I share here the notes and outcomes of the retro we had @Rachel, @Matt-Yorkley and I regarding last week’s downtime UK experienced.
Deltas (things we need to improve)
There was no aggregated view of the number of failed requests.
No way to know that some requests where taking 20s or more. Logging into the server and checking manually was a waste of time.
We need to be more aware of the impact pages that don’t require login can have in performance. The malicious requests were using v1’s map, producers and groups pages. We need to be extra careful with performance problems in these pages.
We have lots of things we have no visibility for
We have 0 security monitoring
Be more proactive on security threads. Put some things in place.
There’s the feeling that we left Australia on their own. They also had a shorter downtime but with fewer consequences and we were focused on the UK. Could that cause tensions?
Similar issues could happen in newer servers like happened to the US and we won’t notice
Pluses (things that went well)
Nginx logs were invaluable to compare to metrics
We now have a new repo for security issues
We handled communication very well: we didn’t tell anyone since the very beginning some people would have freaked out.
We might need to have a discussion about operations for the Australian server. They deserve to be treated like all others but is that something AU team is doing now? If we decide to bring this into the global team as well, we’ll need it to follow the same processes and tools. Using a separate set of tools makes things less efficient and more error-prone.
Improvements in the monitoring strategy
To improve our monitoring strategy we need to:
Enable the basic HappyApps alarms for all server no matter how many users they have
Enable Nginx logs for all instance monitored with Datadog by the core team. That’s currently: France, UK, and Katuma
There’s however an open question with production instances that have some traffic but are not the big ones: Belgium, Canada, and possible others. There is no monitoring so far and issues could happen. Their users also deserve to receive the same level of service. We should bring them into Datadog as well.
In order to face the extra costs that would represent, we propose to charge a fixed monthly fee beyond their expected contributions to the global pot so that OFN can confidently pay these 3rd party services like Datadog. They would have to pay these costs if they had to deal with operations at the local level anyway.
Data about response times
By the same token, we discussed the fact that we need to have numbers around response times. That was what Skylight was for but because our Rails version is not fully supported numbers are not consistent which makes less useful. There are no numbers for the endpoints affected during the downtime.
I propose we pay for Datadog’s Application Performance Monitoring (APM). Even with its high cost, we would save money. APM for a single host costs 31$/month (See the pricing page) while 1h of my time is currently paid at 40 and the overall investigation of the downtime took me 3h 32min in total.
Matt put into place some new firewall rules that should protect us from malicious attacks. He’s now evaluating them in UK’s production without applying them so soon we’ll enable them to all instances.
It was an exciting and eventful week! We should have some new things in place soon that will allow us to make more #datadriven decisions on this front, as well as some greatly improved proactive security measures.
On the monitoring strategy side of things, I’m not sure about throwing more money at Datadog. Does their APM work with Rails 3?
…I have a proposal for a vastly improved monitoring system using a self-hosted ElasticStack server that would enable us to have full metrics, unlimited log collection, security monitoring, custom alerting, and (when we get to Rails 4) APM, across all instances, for the price of a server ($20?), instead of $50-$100+ per instance (if you include APM and full log collection). And we have ~10 instances now?
As soon as that above issue is closed/merged (the ElasticStack Postgres dashboard isn’t fully implemented), I’ll be strongly recommending we switch from Datadog. We would have all the features we have now, plus much more, on all servers, for a fraction of the price. Also the data visualization and custom Dashbord building is incredible.
P.S. I finished writing some Ansible playbooks/roles that would allow us to implement this setup about a month ago…
P.P.S if we start using the Intrusion Detection System I just added in the ofn-security repo, we could ingest the logs from that as well with this monitoring setup, with custom alerting and high-level visibility of security issues. For all servers, at no additional cost. And it would be really easy…
You are not counting the cost of maintaining the server up, running, managed, updated, backed-up, etc. That’s the price of a Saas like datadog.
I tend to prefer to pay fo Saas solutions due to the costs of operation but I think it’s also acceptable that we switch to a selfhosted solution for all the other reasons you mention above, mainly: we get a better solution!
Yes, there’s the cost of maintaining one server with an occasional update or turning-it-off-and-on-again.
Pau mentioned something yesterday, which is that the cost of a sysadmin messing around for three hours because they have to rummage around on the server grepping logfiles to figure out what’s happening is actually really expensive!
We have around 14 servers in total (including staging), and that number is probably going to go up. I think we could get the equivalent value of a $2000/month Datadog bill by using a self-hosted setup with a $20-40/month server and $30-60/month of syadmin time, and for every hour we spent maintaining it, we’d save 3+ hours by having awesome monitoring everywhere.
And for example if/when a new instance comes on board, there would be no additional cost for giving them all the metrics, logs, APM, alerts etc. So we wouldn’t think “ah that instance is a bit small, maybe we shouldn’t waste $50-$100 per month on it”. They could just get the whole (delux) package straight away, and we wouldn’t lose anything.
And we could for example add monitoring on all of our staging servers (which might be really useful sometimes), at no additional cost. Because why not?
Zero Marginal Costs! That’s the exciting bit.
Sorry for dumping all my thoughts out at once, I’ve thought about this quite a lot!
I bet it does because Datadog it’s been around for a long time but also the ruby client has tests for our particular version.
Your suggestion is great but until we don’t get to Rails 4 and that issue is solved there’s no other option other than sticking to Datadog a bit longer, I think. We can’t afford waiting for ElastiStack to be ready to improve our monitoring.
Moving this forward
we still need to hear @maikel’s and @Kirsten’s thoughts on this and discuss it in our delivery train meeting. I’m happy to raise the topic in the meeting.
I like @Matt-Yorkley’s recommendation of going with self-hosted ElasticStack.
The growth of OFN involves getting more and more instances on-board. Based on the numbers given by @Matt-Yorkley, it sounds like OFN has passed the critical point where it is now more cost-efficient to go for a self-hosted and customizable solution. It is a good thing.
With regards to dev/sysad load, I imagine this would consistently require hours though. I agree that it will probably end up cheaper than using Datadog for all instances, but it will mean less time going to software development for OFN from the current team.
If I understand you correctly, @sauloperez, you are suggesting that we set up HappyApps and Datadog for Aus. I have no problem with that. I didn’t see the need because Wormly works quite well for us and has indefinite data retention. And I didn’t realise that the Wormly account wasn’t in Bitwarden before. I added it now so that you can have a look. I haven’t worked enough with Datadog to make a call what’s better.
Adding another service for monitoring sounds like a burden to me. We need to focus. But I also see that it could be cheaper to run. I also prefer to use free software instead of supporting commercial services. And we could use that infrastructure for other things in the future. Maybe there are other servers we want to monitor or we want to check if a farm database endpoint is online to update stock levels (just dreaming). But seriously, it could be one of the many OFN tools to support the local food distribution industry.
Summary: If we can easily set up an Ansible managed server that replaces Wormly, Datadog and HappyApps then I’m for it. If it also has an API so that we can set up the monitoring within ofn-install provisioning, I’m very happy.
If I understand you correctly, @sauloperez, you are suggesting that we set up HappyApps and Datadog for Aus.
Well, not exactly. As I pointed out in australia operations above, it’d be fair to bring this responsibilities to the global team as happens with all other instances. We think that the AU could have felt excluded when your instance also had downtime but the team focused on UK only. Therefore, I want to:
Know who is doing AU operations so far
if it is something that we want to bring into the global team
then AU team sticks to what we decide we want to use globally. We won’t have a specific set of tools and processes only for AU. Of course, these can change if we decide to.
we continue doing what the AU team might be doing now.
clear now @maikell? thoughts?
As for ElastiStack, I think we all are on the same page. It seems to be the right time if we want to have monitoring for all instances but https://github.com/elastic/beats/issues/13103 is stopping us from doing so (pipe priorities aside though). However, I think we all agree as well on the fact that we have to improve our processes in the meantime.
So here is the list of issues I’ll create taking into account all the feedback:
Enable HappyApps alarms for all servers except AU for now
Enable Nginx logs monitoring for FR, UK and ES in Datadog
Enable Datadog’s APM for the UK to try it out.
Additionally, we’ll have to watch after that ElasticStack issue to see when can we migrate all servers to it.
The Postgres dashboard issue has been merged in the ElasticStack repo
We’re hopefully upgrading to Ruby 2.2.10 in the next week or so, which means we should now be able to use the ElasticStack APM for Rails. Which means we can have APM everywhere
I’ve got an entire Ansible repo for this, and it’s already finished. It can provision an ElasticStack server and provision the instances with the necessary metric and log shippers. I havn’t pushed the repo yet. I can do a quick test of the APM when I get some time, but it should work once we are using Ruby 2.2.x
Edit: second update:
the APM is still not working and it looks like the latest ElasticStack APM gems need Ruby 2.3.x, which may be out of reach for now. I can take another look though.
If we suppose a $40/month server, and we can get metrics and unlimited logs on all servers, I think it’s still a good idea. We could drop down to Datadog APM on one host for $40/month and move all the other monitoring to ElasticStack and still save money and gain lots of value.
While I’m not opposed to your proposal just to share my concerns about this:
I’m afraid we might spend considerable time working on this, which becomes trickier with current situation of the project: less dedication from everyone.
I’m scared of having to create it everything from scratch and having to become expert on the topic. I know there might be predefined dashboards and such but I still feel it in my gut. Also because we at Coopdevs are kind of experiencing that using Prometheus and Grafana. I feel like Datadog, prevents that and the understanding we came to have now is pretty good.
It felt to me like a step towards the opposite direction of having a single infrastructure (having to monitor N servers instead of one) but I also see this is not going to happen soon and the data will be beneficial in the meantime.
I guess the important point is to be agile and review our decisions later on so we don’t end up stuck with something that we then see doesn’t move in the right direction.
Also, we should plan this in probably do it in small chunks. Who knows if the most cost-effective solution is to have a hybrid solution as you somehow suggest?