Migrating a production server


#1

We set up new servers with ofn-install.
And that’s cool. But we don’t have any scripts to help moving from one
server to another without downtime. This is my story of moving the Australian
production instance with three minutes downtime. It took me 15 hours so far
and there are still some outstanding pull requests and some documentation to
update.

First, I created a to-do list. I do not claim that it is complete.
I tried to create a process that allows to undo steps where possible.
The big exception is the database switch.
Assuming that data is constantly changing, we can’t just go back to the
previous database or a backup without loosing the most recent data.
I did not perform a master-master replication on database level due to lack
of experience and to limit the time I’m spending on this.

My checklist starts after I successfully installed openfoodnetwork via
ofn-install on a new server called prod2.openfoodnetwork.org.au.
I chose a new unique name which makes it possible to use the Ansible scripts
and Letsencrypt independently of the old server.

This is just the first time doing all of this with an OFN server.
I am aware that I took some shortcuts and there is more work to do in
ofn-install to make this easier.
I hope we can refine the process and simplify it every time we have to repeat
it.

Checklist

Long term preparation:

  • [ ] change DNS TTL to 5 minutes

Configure old server:

  • [ ] set up firewall
    # set up ufw with all rules, then:
    ufw allow from 43.239.97.146 to any port postgresql
    
  • [ ] allow postgresql to receive connections from the new server

Prepare new server:

  • [ ] deactivate git post-receive hook for deployments

    # Tell other devs to not deploy to the old server:
    echo "[ERROR] Aborting deploy!"
    echo "[Thu 8 Feb 2018] Maikel is about to switch to the new production server."
    exit 1
    
  • [ ] customise deploy script (post-receive hook)

  • [ ] update post-receive hook to not install cronjobs for now (not execute whenever)

  • [ ] add logrotate to deploy script (post-receive), see GH issue

    # Rotate log files
    logrotate -s "$CURRENT_PATH/log/logrotate-status" "$CURRENT_PATH/log/logrotate.conf"
    
  • [ ] copy and adapt initializers and config

    #!/bin/sh
    
    rsync -avz --delete ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/public/system/ apps/openfoodnetwork/current/public/system/
    rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/public/assets/ofn_logo_black.png apps/openfoodnetwork/current/public/assets/ofn_logo_black.png
    rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/shared/config/bugsnag.rb apps/openfoodnetwork/current/config/initializers/bugsnag.rb
    rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/config/initializers/log_before_timeout.rb apps/openfoodnetwork/current/config/initializers/log_before_timeout.rb
    rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/config/initializers/newrelic.rb apps/openfoodnetwork/current/config/initializers/newrelic.rb
    rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/shared/config/s3.yml apps/openfoodnetwork/current/config/s3.yml
    rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/log/logrotate.conf apps/openfoodnetwork/current/log/logrotate.conf
    
  • [ ] deploy master to new server (start new application)

Switch delayed jobs:

  • [ ] deactivate monit for old delayed job
  • [ ] stop old delayed job: RAILS_ENV=production ./script/delayed_job stop
  • [ ] start new delayed job: RAILS_ENV=production ./script/delayed_job start
  • [ ] monitor log file: tail -f ~/apps/openfoodnetwork/current/log/delayed_job.log

Switch application:

  • [ ] nginx proxy pass from old to new app
    # remove other `location /` rule and add
      location / {
        proxy_pass https://prod2.openfoodnetwork.org.au;
      }
    
  • [ ] clear cache of new application: bundle exec rails c production and Rails.cache.clear
  • [ ] deactivate monit for old application
  • [ ] shut down the old application
  • [ ] disable startup of old application

Switch cron jobs:

  • [ ] clear cron jobs on old server
  • [ ] install cron jobs on new server
  • [ ] update post-receive hook to install cronjobs

Expand Letsencrypt cert on new server:

  • [ ] Configure new nginx to listen do production domain
  • [ ] Forward http traffic from old to new server
  • [ ] /opt/certbot/certbot-auto certonly -a webroot -w /home/openfoodnetwork/apps/openfoodnetwork/current/public/ -d prod2.openfoodnetwork.org.au -d openfoodnetwork.org.au --expand
  • [ ] Let nginx pass on the real domain name to test final setup
    # add Host header
      location / {
        proxy_pass https://prod2.openfoodnetwork.org.au;
        proxy_set_header Host $host;
      }
    

Switch databases:

  • [ ] change database.yml
  • [ ] stop delayed job: (cd apps/openfoodnetwork/current/ && RAILS_ENV=production ./script/delayed_job stop)
  • [ ] place /home/openfoodnetwork/apps/openfoodnetwork/current/public/index.html (this should probably be an nginx error page)
    Sorry, you just caught us doing some maintenance.
    Please come back in five minutes.
    
  • [ ] stop new application: /etc/init.d/unicorn_openfoodnetwork stop
  • [ ] copy database (2 minutes)
    #!/bin/bash
    
    set -e
    
    pg_dump -h openfoodnetwork.org.au -U openfoodweb openfoodweb_production > /tmp/openfoodweb_production.sql
    dropdb -h localhost -U ofn_user openfoodnetwork
    createdb -h localhost -U ofn_user openfoodnetwork
    psql -h localhost -U ofn_user openfoodnetwork < /tmp/openfoodweb_production.sql
    
    echo "Done."
    
  • [ ] start new application
  • [ ] remove /home/openfoodnetwork/apps/openfoodnetwork/current/public/index.html
  • [ ] check old database for connections: echo “select * from pg_stat_activity;” | sudo -u postgres psql
  • [ ] start delayed job

Finishing the switch:

  • [ ] update post-receive hook on new server
  • [ ] update post-receive hook on old server
    # Tell other devs to deploy to the new server:
    echo "[ERROR] Aborting deploy!"
    echo "[Sat 3 Mar 2018] We have a new production server:"
    echo ""
    echo "  git remote set-url aus-production openfoodnetwork@prod2.openfoodnetwork.org.au:apps/openfoodnetwork/current"
    exit 1
    
  • [ ] update Buildkite to deploy to new server, add public key to server, sudo -u buildkite-agent ssh openfoodnetwork@prod2.openfoodnetwork.org.au
  • [ ] install monit on new server
  • [ ] Check Letsencrypt renewals. 10 13 * * * /opt/certbot/certbot-auto renew --quiet --no-self-upgrade

After monitoring for a few days:

  • [ ] change DNS entry to new server

After monitoring for a few more days:

  • [ ] change DNS TTL to 5 minutes

Result

If you have any ideas how to do this a lot better with no extra work,
I will be deeply sad that I spent so much time on this and deeply happy that
it will be better next time. :wink:


#2

We just switch over the French production server. Since we don’t have monit any more and this server is not using the Git post-receive hook, we could omit a lot of steps. Config files, initializers and images have been synchronised before. For the switch, we basically did these steps:

Expand Letsencrypt cert on new server:

  • [x] Configure new nginx to listen to production domain: server_name prod.openfoodfrance.org www.openfoodfrance.org openfoodfrance.org;
  • [x] Forward http traffic from old to new server
     # add Host header
     location '/.well-known/acme-challenge' {
       proxy_pass http://prod.openfoodfrance.org;
     }
    
  • [x] certbot certonly -d prod.openfoodfrance.org -d www.openfoodfrance.org -d openfoodfrance.org --expand

Stop the new server:

  • [x] stop delayed job: sudo systemctl stop delayed_job_openfoodnetwork.service
  • [x] stop new application: sudo systemctl stop unicorn_openfoodnetwork.service

Stop the old server:

  • [x] place /home/openfoodnetwork/apps/openfoodnetwork/current/public/index.html (this should probably be an nginx error page)
    Sorry, you just caught us doing some maintenance.
    Please come back in five minutes.
    
  • [x] stop delayed job: sudo systemctl stop delayed_job_openfoodnetwork.service
  • [x] stop old application: sudo systemctl stop unicorn_openfoodnetwork.service

Switch databases:

  • [x] copy database (2 minutes)
    #!/bin/bash
    
    set -e
    
    ssh offrance@openfoodfrance.org "pg_dump -h localhost -U ofn_user openfoodnetwork | gzip" > backup_20180809.sql.tgz
    psql -h localhost -U ofn_user postgres -c "drop database openfoodnetwork"
    psql -h localhost -U ofn_user postgres -c "create database openfoodnetwork"
    zcat backup_20180809.sql.tgz | psql -h localhost -U ofn_user openfoodnetwork
    
    echo "Done."
    
  • [x] start new application: sudo systemctl start unicorn_openfoodnetwork.service
  • [x] start delayed job: sudo systemctl start delayed_job_openfoodnetwork.service

Traffic switch:

  • [x] Forward all traffic to the new server:
     # add Host header
     location / {
       proxy_pass https://prod.openfoodfrance.org;
       proxy_set_header Host $host;
     }
    
  • nginx -t && service nginx reload

I hope we can simplify this further and put some of it in ofn-install.