Proposal: Monitoring with ElasticStack

Hello! I’d like to make a proposal to switch our monitoring from Datadog (SAAS) to a self-hosted ElasticStack setup.

Datadog vs ElasticStack

We currently have limited monitoring on the “big three” instances (aka the Triumvirate) and no monitoring on the others, and this costs us around $180/month. These costs would increase if we added monitoring on additional servers (like US and Canada) or additional monitoring features, like improved log ingestion. The price:value ratio is pretty terrible and awful at scale.

If we switch to a self-hosted ElasticStack setup we can have world-class monitoring on all servers, and the cost will be the price of a decent server (maybe $40/month). The costs would not increase above $40/month if we added any number of additional servers, or any additional monitoring features. I think we could easily get the equivalent monitoring value of a $2000/month Datadog bill out of that $40/month.

Metabase

With the Metabase work moving forward there is a growing need for custom monitoring to be added to the Metabase server. If we do it with Datadog it will be expensive…

Alerts

I think Aus production currently has some custom alerting on metrics like disk space (via Wormly), which would be great to have on all servers. With ElasticStack we can set up any kind of alerts we want, like server uptime or alerts when the disk is dangerously full, on all servers. As OFN continues to grow I think we will increasingly need this kind of setup for managing an expanding infrastructure with a small team.

Cost of setup

Setting up a nice Ansible repo for automatically provisioning and configuring the ElasticStack server and configuring the new monitoring agents takes quite a bit of work… which I’ve already done. I pushed the new (private) repo this morning for devs to look at and can do a live demo at some point. It’s been sat on my hard drive for a while because it required Rails 4, but now that we’re there I’ve retested it and it’s basically production-ready.

3 Likes

While investigating UK downtime just now @Matt-Yorkley and I were talking about some of the lacking security monitoring within OFN.

I’d love to hear the security case for switching to ElasticStack :pray:

Yep, as a secondary benefit it would also enable various options for security monitoring across all our servers, which would be really nice.

I fully support this switch.
I look forward having app logs in there as well! And with that, and some work, I think this can replace bugsnag. Or at least be integrated with bugsnag alerts so we have a single source of monitoring data and alerts.

I think @sauloperez concern was what datadog features are we going to miss?

Are there any other concerns about this move?