UK's downtime retro

I share here the notes and outcomes of the retro we had @Rachel, @Matt-Yorkley and I regarding last week’s downtime UK experienced.

Deltas (things we need to improve)

  • There was no aggregated view of the number of failed requests.
  • No way to know that some requests where taking 20s or more. Logging into the server and checking manually was a waste of time.
  • We need to be more aware of the impact pages that don’t require login can have in performance. The malicious requests were using v1’s map, producers and groups pages. We need to be extra careful with performance problems in these pages.
  • We have lots of things we have no visibility for
  • We have 0 security monitoring
  • Be more proactive on security threads. Put some things in place.
  • There’s the feeling that we left Australia on their own. They also had a shorter downtime but with fewer consequences and we were focused on the UK. Could that cause tensions?
  • Similar issues could happen in newer servers like happened to the US and we won’t notice

Pluses (things that went well)

  • Nginx logs were invaluable to compare to metrics
  • We now have a new repo for security issues
  • We handled communication very well: we didn’t tell anyone since the very beginning some people would have freaked out.

Outcomes

Australian operations

We might need to have a discussion about operations for the Australian server. They deserve to be treated like all others but is that something AU team is doing now? If we decide to bring this into the global team as well, we’ll need it to follow the same processes and tools. Using a separate set of tools makes things less efficient and more error-prone.

Improvements in the monitoring strategy

To improve our monitoring strategy we need to:

  • Enable the basic HappyApps alarms for all server no matter how many users they have
  • Enable Nginx logs for all instance monitored with Datadog by the core team. That’s currently: France, UK, and Katuma

There’s however an open question with production instances that have some traffic but are not the big ones: Belgium, Canada, and possible others. There is no monitoring so far and issues could happen. Their users also deserve to receive the same level of service. We should bring them into Datadog as well.

In order to face the extra costs that would represent, we propose to charge a fixed monthly fee beyond their expected contributions to the global pot so that OFN can confidently pay these 3rd party services like Datadog. They would have to pay these costs if they had to deal with operations at the local level anyway.

Data about response times

By the same token, we discussed the fact that we need to have numbers around response times. That was what Skylight was for but because our Rails version is not fully supported numbers are not consistent which makes less useful. There are no numbers for the endpoints affected during the downtime.

I propose we pay for Datadog’s Application Performance Monitoring (APM). Even with its high cost, we would save money. APM for a single host costs 31$/month (See the pricing page) while 1h of my time is currently paid at 40 and the overall investigation of the downtime took me 3h 32min in total.

Firewall

Matt put into place some new firewall rules that should protect us from malicious attacks. He’s now evaluating them in UK’s production without applying them so soon we’ll enable them to all instances.

This should protect us from future attacks.

@Rachel @Matt-Yorkley feel free to amend anything.

1 Like

It was an exciting and eventful week! We should have some new things in place soon that will allow us to make more #datadriven decisions on this front, as well as some greatly improved proactive security measures.

On the monitoring strategy side of things, I’m not sure about throwing more money at Datadog. Does their APM work with Rails 3?

Pending this issue being resolved: https://github.com/elastic/beats/issues/13103

…I have a proposal for a vastly improved monitoring system using a self-hosted ElasticStack server that would enable us to have full metrics, unlimited log collection, security monitoring, custom alerting, and (when we get to Rails 4) APM, across all instances, for the price of a server ($20?), instead of $50-$100+ per instance (if you include APM and full log collection). And we have ~10 instances now?

As soon as that above issue is closed/merged (the ElasticStack Postgres dashboard isn’t fully implemented), I’ll be strongly recommending we switch from Datadog. We would have all the features we have now, plus much more, on all servers, for a fraction of the price. Also the data visualization and custom Dashbord building is incredible.

https://www.elastic.co/products/kibana

P.S. I finished writing some Ansible playbooks/roles that would allow us to implement this setup about a month ago… :heart:

P.P.S if we start using the Intrusion Detection System I just added in the ofn-security repo, we could ingest the logs from that as well with this monitoring setup, with custom alerting and high-level visibility of security issues. For all servers, at no additional cost. And it would be really easy…

You are not counting the cost of maintaining the server up, running, managed, updated, backed-up, etc. That’s the price of a Saas like datadog.
I tend to prefer to pay fo Saas solutions due to the costs of operation but I think it’s also acceptable that we switch to a selfhosted solution for all the other reasons you mention above, mainly: we get a better solution!

Yes, there’s the cost of maintaining one server with an occasional update or turning-it-off-and-on-again.

Pau mentioned something yesterday, which is that the cost of a sysadmin messing around for three hours because they have to rummage around on the server grepping logfiles to figure out what’s happening is actually really expensive!

We have around 14 servers in total (including staging), and that number is probably going to go up. I think we could get the equivalent value of a $2000/month Datadog bill by using a self-hosted setup with a $20-40/month server and $30-60/month of syadmin time, and for every hour we spent maintaining it, we’d save 3+ hours by having awesome monitoring everywhere.

And for example if/when a new instance comes on board, there would be no additional cost for giving them all the metrics, logs, APM, alerts etc. So we wouldn’t think “ah that instance is a bit small, maybe we shouldn’t waste $50-$100 per month on it”. They could just get the whole (delux) package straight away, and we wouldn’t lose anything.

And we could for example add monitoring on all of our staging servers (which might be really useful sometimes), at no additional cost. Because why not?

Zero Marginal Costs! That’s the exciting bit.

Sorry for dumping all my thoughts out at once, I’ve thought about this quite a lot!

1 Like

Sure, we’re all on the same page. It’s all about

Zero Marginal Costs! That’s the exciting bit.

So to answer you @Matt-Yorkley

Does their APM work with Rails 3?

I bet it does because Datadog it’s been around for a long time but also the ruby client has tests for our particular version.

Your suggestion is great but until we don’t get to Rails 4 and that issue is solved there’s no other option other than sticking to Datadog a bit longer, I think. We can’t afford waiting for ElastiStack to be ready to improve our monitoring.

Moving this forward

we still need to hear @maikel’s and @Kirsten’s thoughts on this and discuss it in our delivery train meeting. I’m happy to raise the topic in the meeting.

I like @Matt-Yorkley’s recommendation of going with self-hosted ElasticStack. :slight_smile:

The growth of OFN involves getting more and more instances on-board. Based on the numbers given by @Matt-Yorkley, it sounds like OFN has passed the critical point where it is now more cost-efficient to go for a self-hosted and customizable solution. :slightly_smiling_face: :tada: It is a good thing.

With regards to dev/sysad load, I imagine this would consistently require hours though. I agree that it will probably end up cheaper than using Datadog for all instances, but it will mean less time going to software development for OFN from the current team.

Great summary, @sauloperez. Thank you.

I very much agree that all instances even the new/smaller ones should have the same security and monitoring systems in place.

1 Like

If I understand you correctly, @sauloperez, you are suggesting that we set up HappyApps and Datadog for Aus. I have no problem with that. I didn’t see the need because Wormly works quite well for us and has indefinite data retention. And I didn’t realise that the Wormly account wasn’t in Bitwarden before. I added it now so that you can have a look. I haven’t worked enough with Datadog to make a call what’s better.

Adding another service for monitoring sounds like a burden to me. We need to focus. But I also see that it could be cheaper to run. I also prefer to use free software instead of supporting commercial services. And we could use that infrastructure for other things in the future. Maybe there are other servers we want to monitor or we want to check if a farm database endpoint is online to update stock levels (just dreaming). But seriously, it could be one of the many OFN tools to support the local food distribution industry.

Summary: If we can easily set up an Ansible managed server that replaces Wormly, Datadog and HappyApps then I’m for it. If it also has an API so that we can set up the monitoring within ofn-install provisioning, I’m very happy. :slight_smile:

If I understand you correctly, @sauloperez, you are suggesting that we set up HappyApps and Datadog for Aus.

Well, not exactly. As I pointed out in australia operations above, it’d be fair to bring this responsibilities to the global team as happens with all other instances. We think that the AU could have felt excluded when your instance also had downtime but the team focused on UK only. Therefore, I want to:

  • Know who is doing AU operations so far
  • if it is something that we want to bring into the global team
    • then AU team sticks to what we decide we want to use globally. We won’t have a specific set of tools and processes only for AU. Of course, these can change if we decide to.
  • else
    • we continue doing what the AU team might be doing now.

clear now @maikell? thoughts?

As for ElastiStack, I think we all are on the same page. It seems to be the right time if we want to have monitoring for all instances but https://github.com/elastic/beats/issues/13103 is stopping us from doing so (pipe priorities aside though). However, I think we all agree as well on the fact that we have to improve our processes in the meantime.

So here is the list of issues I’ll create taking into account all the feedback:

  • Enable HappyApps alarms for all servers except AU for now
  • Enable Nginx logs monitoring for FR, UK and ES in Datadog
  • Enable Datadog’s APM for the UK to try it out.

Additionally, we’ll have to watch after that ElasticStack issue to see when can we migrate all servers to it.