Deltas (things we need to improve)
- There was no aggregated view of the number of failed requests.
- No way to know that some requests where taking 20s or more. Logging into the server and checking manually was a waste of time.
- We need to be more aware of the impact pages that don’t require login can have in performance. The malicious requests were using v1’s map, producers and groups pages. We need to be extra careful with performance problems in these pages.
- We have lots of things we have no visibility for
- We have 0 security monitoring
- Be more proactive on security threads. Put some things in place.
- There’s the feeling that we left Australia on their own. They also had a shorter downtime but with fewer consequences and we were focused on the UK. Could that cause tensions?
- Similar issues could happen in newer servers like happened to the US and we won’t notice
Pluses (things that went well)
- Nginx logs were invaluable to compare to metrics
- We now have a new repo for security issues
- We handled communication very well: we didn’t tell anyone since the very beginning some people would have freaked out.
We might need to have a discussion about operations for the Australian server. They deserve to be treated like all others but is that something AU team is doing now? If we decide to bring this into the global team as well, we’ll need it to follow the same processes and tools. Using a separate set of tools makes things less efficient and more error-prone.
Improvements in the monitoring strategy
To improve our monitoring strategy we need to:
- Enable the basic HappyApps alarms for all server no matter how many users they have
- Enable Nginx logs for all instance monitored with Datadog by the core team. That’s currently: France, UK, and Katuma
There’s however an open question with production instances that have some traffic but are not the big ones: Belgium, Canada, and possible others. There is no monitoring so far and issues could happen. Their users also deserve to receive the same level of service. We should bring them into Datadog as well.
In order to face the extra costs that would represent, we propose to charge a fixed monthly fee beyond their expected contributions to the global pot so that OFN can confidently pay these 3rd party services like Datadog. They would have to pay these costs if they had to deal with operations at the local level anyway.
Data about response times
By the same token, we discussed the fact that we need to have numbers around response times. That was what Skylight was for but because our Rails version is not fully supported numbers are not consistent which makes less useful. There are no numbers for the endpoints affected during the downtime.
I propose we pay for Datadog’s Application Performance Monitoring (APM). Even with its high cost, we would save money. APM for a single host costs 31$/month (See the pricing page) while 1h of my time is currently paid at 40 and the overall investigation of the downtime took me 3h 32min in total.
Matt put into place some new firewall rules that should protect us from malicious attacks. He’s now evaluating them in UK’s production without applying them so soon we’ll enable them to all instances.
This should protect us from future attacks.