We set up new servers with ofn-install.
And that’s cool. But we don’t have any scripts to help moving from one
server to another without downtime. This is my story of moving the Australian
production instance with three minutes downtime. It took me 15 hours so far
and there are still some outstanding pull requests and some documentation to
update.
First, I created a to-do list. I do not claim that it is complete.
I tried to create a process that allows to undo steps where possible.
The big exception is the database switch.
Assuming that data is constantly changing, we can’t just go back to the
previous database or a backup without loosing the most recent data.
I did not perform a master-master replication on database level due to lack
of experience and to limit the time I’m spending on this.
My checklist starts after I successfully installed openfoodnetwork via
ofn-install on a new server called prod2.openfoodnetwork.org.au
.
I chose a new unique name which makes it possible to use the Ansible scripts
and Letsencrypt independently of the old server.
This is just the first time doing all of this with an OFN server.
I am aware that I took some shortcuts and there is more work to do in
ofn-install to make this easier.
I hope we can refine the process and simplify it every time we have to repeat
it.
Checklist
Long term preparation:
- [ ] change DNS TTL to 5 minutes
Configure old server:
- [ ] set up firewall
# set up ufw with all rules, then: ufw allow from 43.239.97.146 to any port postgresql
- [ ] allow postgresql to receive connections from the new server
Prepare new server:
-
[ ] deactivate git post-receive hook for deployments
# Tell other devs to not deploy to the old server: echo "[ERROR] Aborting deploy!" echo "[Thu 8 Feb 2018] Maikel is about to switch to the new production server." exit 1
-
[ ] customise deploy script (post-receive hook)
-
[ ] update post-receive hook to not install cronjobs for now (not execute
whenever
) -
[ ] add logrotate to deploy script (post-receive), see GH issue
# Rotate log files logrotate -s "$CURRENT_PATH/log/logrotate-status" "$CURRENT_PATH/log/logrotate.conf"
-
[ ] copy and adapt initializers and config
#!/bin/sh rsync -avz --delete ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/public/system/ apps/openfoodnetwork/current/public/system/ rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/public/assets/ofn_logo_black.png apps/openfoodnetwork/current/public/assets/ofn_logo_black.png rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/shared/config/bugsnag.rb apps/openfoodnetwork/current/config/initializers/bugsnag.rb rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/config/initializers/log_before_timeout.rb apps/openfoodnetwork/current/config/initializers/log_before_timeout.rb rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/config/initializers/newrelic.rb apps/openfoodnetwork/current/config/initializers/newrelic.rb rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/shared/config/s3.yml apps/openfoodnetwork/current/config/s3.yml rsync -avz ubuntu@openfoodnetwork.org.au:apps/openfoodweb/current/log/logrotate.conf apps/openfoodnetwork/current/log/logrotate.conf
-
[ ] deploy master to new server (start new application)
Switch delayed jobs:
- [ ] deactivate monit for old delayed job
- [ ] stop old delayed job:
RAILS_ENV=production ./script/delayed_job stop
- [ ] start new delayed job:
RAILS_ENV=production ./script/delayed_job start
- [ ] monitor log file:
tail -f ~/apps/openfoodnetwork/current/log/delayed_job.log
Switch application:
- [ ] nginx proxy pass from old to new app
# remove other `location /` rule and add location / { proxy_pass https://prod2.openfoodnetwork.org.au; }
- [ ] clear cache of new application:
bundle exec rails c production
andRails.cache.clear
- [ ] deactivate monit for old application
- [ ] shut down the old application
- [ ] disable startup of old application
Switch cron jobs:
- [ ] clear cron jobs on old server
- [ ] install cron jobs on new server
- [ ] update post-receive hook to install cronjobs
Expand Letsencrypt cert on new server:
- [ ] Configure new nginx to listen do production domain
- [ ] Forward http traffic from old to new server
- [ ] /opt/certbot/certbot-auto certonly -a webroot -w /home/openfoodnetwork/apps/openfoodnetwork/current/public/ -d prod2.openfoodnetwork.org.au -d openfoodnetwork.org.au --expand
- [ ] Let nginx pass on the real domain name to test final setup
# add Host header location / { proxy_pass https://prod2.openfoodnetwork.org.au; proxy_set_header Host $host; }
Switch databases:
- [ ] change database.yml
- [ ] stop delayed job: (cd apps/openfoodnetwork/current/ && RAILS_ENV=production ./script/delayed_job stop)
- [ ] place /home/openfoodnetwork/apps/openfoodnetwork/current/public/index.html (this should probably be an nginx error page)
Sorry, you just caught us doing some maintenance. Please come back in five minutes.
- [ ] stop new application: /etc/init.d/unicorn_openfoodnetwork stop
- [ ] copy database (2 minutes)
#!/bin/bash set -e pg_dump -h openfoodnetwork.org.au -U openfoodweb openfoodweb_production > /tmp/openfoodweb_production.sql dropdb -h localhost -U ofn_user openfoodnetwork createdb -h localhost -U ofn_user openfoodnetwork psql -h localhost -U ofn_user openfoodnetwork < /tmp/openfoodweb_production.sql echo "Done."
- [ ] start new application
- [ ] remove /home/openfoodnetwork/apps/openfoodnetwork/current/public/index.html
- [ ] check old database for connections: echo “select * from pg_stat_activity;” | sudo -u postgres psql
- [ ] start delayed job
Finishing the switch:
- [ ] update post-receive hook on new server
- [ ] update post-receive hook on old server
# Tell other devs to deploy to the new server: echo "[ERROR] Aborting deploy!" echo "[Sat 3 Mar 2018] We have a new production server:" echo "" echo " git remote set-url aus-production openfoodnetwork@prod2.openfoodnetwork.org.au:apps/openfoodnetwork/current" exit 1
- [ ] update Buildkite to deploy to new server, add public key to server,
sudo -u buildkite-agent ssh openfoodnetwork@prod2.openfoodnetwork.org.au
- [ ] install monit on new server
- [ ] Check Letsencrypt renewals.
10 13 * * * /opt/certbot/certbot-auto renew --quiet --no-self-upgrade
After monitoring for a few days:
- [ ] change DNS entry to new server
After monitoring for a few more days:
- [ ] change DNS TTL to 5 minutes
Result
If you have any ideas how to do this a lot better with no extra work,
I will be deeply sad that I spent so much time on this and deeply happy that
it will be better next time.