We recently had some instability and response time issues with the Australian production server that serves openfoodnetwork.org.au as well as a number of Wordpress sites such as openfoodnetwork.org. Here’s a quick write-up of the symptoms we saw and how we fixed the problem.
We receive SMS alerts from Wormly about any downtime on those two sites, and I was receiving reports of a number of ~5 minute outages every day. Additionally, NewRelic showed that the website was responding maybe 25% more slowly than normal.
SSHing into the server during one of the reported downtimes showed that the unicorn workers and delayed job worker were running as usual, as well as a rake task db2fog:clean.
Looking at the memory usage, the server was using all its RAM and a significant portion of its swap space. So at this point it looks like it’s starved for RAM, and when the db2fog:clean scheduled task starts up, it bumps the server over the edge and it’s no longer able to respond to requests in a reliable or timely manner.
Because the system had ample swap space, at no point did we see OOM (out of memory) errors with processes dying outright. It seemed that the server could hold itself together using swap, but when pushed (as by the scheduled rake task), it wasn’t able to respond to requests in a reasonable amount of time, and we would see 504 gateway timeout errors reported by nginx.
Our server was running on Amazon EC2 on a c1.medium instance, which had only 1.7G of RAM. Since the time that we created that server, Amazon released a c3.large instance which has double the memory, is slightly faster and is also cheaper. Out of hours, we switched the instance type over[1] and the server has been stable (and much faster) ever since.
[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-resize.html