Proposal: Monitoring with ElasticStack

Hello! I’d like to make a proposal to switch our monitoring from Datadog (SAAS) to a self-hosted ElasticStack setup.

Datadog vs ElasticStack

We currently have limited monitoring on the “big three” instances (aka the Triumvirate) and no monitoring on the others, and this costs us around $180/month. These costs would increase if we added monitoring on additional servers (like US and Canada) or additional monitoring features, like improved log ingestion. The price:value ratio is pretty terrible and awful at scale.

If we switch to a self-hosted ElasticStack setup we can have world-class monitoring on all servers, and the cost will be the price of a decent server (maybe $40/month). The costs would not increase above $40/month if we added any number of additional servers, or any additional monitoring features. I think we could easily get the equivalent monitoring value of a $2000/month Datadog bill out of that $40/month.

Metabase

With the Metabase work moving forward there is a growing need for custom monitoring to be added to the Metabase server. If we do it with Datadog it will be expensive…

Alerts

I think Aus production currently has some custom alerting on metrics like disk space (via Wormly), which would be great to have on all servers. With ElasticStack we can set up any kind of alerts we want, like server uptime or alerts when the disk is dangerously full, on all servers. As OFN continues to grow I think we will increasingly need this kind of setup for managing an expanding infrastructure with a small team.

Cost of setup

Setting up a nice Ansible repo for automatically provisioning and configuring the ElasticStack server and configuring the new monitoring agents takes quite a bit of work… which I’ve already done. I pushed the new (private) repo this morning for devs to look at and can do a live demo at some point. It’s been sat on my hard drive for a while because it required Rails 4, but now that we’re there I’ve retested it and it’s basically production-ready.

3 Likes

While investigating UK downtime just now @Matt-Yorkley and I were talking about some of the lacking security monitoring within OFN.

I’d love to hear the security case for switching to ElasticStack :pray:

Yep, as a secondary benefit it would also enable various options for security monitoring across all our servers, which would be really nice.

1 Like

I fully support this switch.
I look forward having app logs in there as well! And with that, and some work, I think this can replace bugsnag. Or at least be integrated with bugsnag alerts so we have a single source of monitoring data and alerts.

I think @sauloperez concern was what datadog features are we going to miss?

Are there any other concerns about this move?

There are a few questions floating around at the moment, like:

  • Can we continue to roll out our data/analytics setup? What impact would it have on smaller instances?
  • Can we use the new security setup on all servers? What impact would it have?
  • How urgently do we need to upgrade the US and Canadian servers? Are they okay or are they on fire?

The current answer to all of these questions is: “we have no idea, and no way of finding out”.

If it’s less expensive and more powerful than Datadog no instance manager will disapprove this switch.

That means it is up to core devs to take a decision here as it means maintaining a new server ? So I guess this needs input from @maikel @apb @sauloperez ?

It’s cheaper in hosting costs but probably more effort in maintenance. Using open source is more value aligned though.

I’m not sure we can outright replace Bugsnag, but we can definitely replace Datadog, HappyApps, and Wormly for all servers.

1 Like

There’s no doubt that we need to expand our monitoring to other instances and at scale, a Saas such as Datadog is going to cost quite a lot of money. However, we tend to greatly underestimate the time it takes to maintain a hosted solution. That’s another server that needs maintainance, upgrading, etc. but also a new tool that needs mastering, and in my experience, anything related to observability is not an easy task.

That being said, however, I do see it as an investment that OFN needs, knowing that we’ll likely end up spending more money on monitoring than we currently are. IMO what’s missing is the scope of what we want to do, a roadmap, and whether or not there’s a budget for it. We should probably discuss this before bringing it into production curation.

Finally, re Bugsnag. IMO they are two separate things with different purposes although there is some sort of overlap.

After the recent changes in the team and thus, the scarce dev resources we have I propose we postpone this and pay for the two extra (CA and US) servers in Datadog. This will have us covered during Xmas sales and provides us with the necessary insights we need to enable data replication for CA. After Xmas, we could start incepting this replacement with ElasticStack. If you all agree we can create and prioritize the issue to enable Datadog for these two ASAP. Thoughts?