Proposal: Monitoring with ElasticStack

Hello! I’d like to make a proposal to switch our monitoring from Datadog (SAAS) to a self-hosted ElasticStack setup.

Datadog vs ElasticStack

We currently have limited monitoring on the “big three” instances (aka the Triumvirate) and no monitoring on the others, and this costs us around $180/month. These costs would increase if we added monitoring on additional servers (like US and Canada) or additional monitoring features, like improved log ingestion. The price:value ratio is pretty terrible and awful at scale.

If we switch to a self-hosted ElasticStack setup we can have world-class monitoring on all servers, and the cost will be the price of a decent server (maybe $40/month). The costs would not increase above $40/month if we added any number of additional servers, or any additional monitoring features. I think we could easily get the equivalent monitoring value of a $2000/month Datadog bill out of that $40/month.

Metabase

With the Metabase work moving forward there is a growing need for custom monitoring to be added to the Metabase server. If we do it with Datadog it will be expensive…

Alerts

I think Aus production currently has some custom alerting on metrics like disk space (via Wormly), which would be great to have on all servers. With ElasticStack we can set up any kind of alerts we want, like server uptime or alerts when the disk is dangerously full, on all servers. As OFN continues to grow I think we will increasingly need this kind of setup for managing an expanding infrastructure with a small team.

Cost of setup

Setting up a nice Ansible repo for automatically provisioning and configuring the ElasticStack server and configuring the new monitoring agents takes quite a bit of work… which I’ve already done. I pushed the new (private) repo this morning for devs to look at and can do a live demo at some point. It’s been sat on my hard drive for a while because it required Rails 4, but now that we’re there I’ve retested it and it’s basically production-ready.

3 Likes

While investigating UK downtime just now @Matt-Yorkley and I were talking about some of the lacking security monitoring within OFN.

I’d love to hear the security case for switching to ElasticStack :pray:

Yep, as a secondary benefit it would also enable various options for security monitoring across all our servers, which would be really nice.

1 Like

I fully support this switch.
I look forward having app logs in there as well! And with that, and some work, I think this can replace bugsnag. Or at least be integrated with bugsnag alerts so we have a single source of monitoring data and alerts.

I think @sauloperez concern was what datadog features are we going to miss?

Are there any other concerns about this move?

There are a few questions floating around at the moment, like:

  • Can we continue to roll out our data/analytics setup? What impact would it have on smaller instances?
  • Can we use the new security setup on all servers? What impact would it have?
  • How urgently do we need to upgrade the US and Canadian servers? Are they okay or are they on fire?

The current answer to all of these questions is: “we have no idea, and no way of finding out”.

If it’s less expensive and more powerful than Datadog no instance manager will disapprove this switch.

That means it is up to core devs to take a decision here as it means maintaining a new server ? So I guess this needs input from @maikel @apb @sauloperez ?

It’s cheaper in hosting costs but probably more effort in maintenance. Using open source is more value aligned though.

I’m not sure we can outright replace Bugsnag, but we can definitely replace Datadog, HappyApps, and Wormly for all servers.

1 Like

There’s no doubt that we need to expand our monitoring to other instances and at scale, a Saas such as Datadog is going to cost quite a lot of money. However, we tend to greatly underestimate the time it takes to maintain a hosted solution. That’s another server that needs maintainance, upgrading, etc. but also a new tool that needs mastering, and in my experience, anything related to observability is not an easy task.

That being said, however, I do see it as an investment that OFN needs, knowing that we’ll likely end up spending more money on monitoring than we currently are. IMO what’s missing is the scope of what we want to do, a roadmap, and whether or not there’s a budget for it. We should probably discuss this before bringing it into production curation.

Finally, re Bugsnag. IMO they are two separate things with different purposes although there is some sort of overlap.

After the recent changes in the team and thus, the scarce dev resources we have I propose we postpone this and pay for the two extra (CA and US) servers in Datadog. This will have us covered during Xmas sales and provides us with the necessary insights we need to enable data replication for CA. After Xmas, we could start incepting this replacement with ElasticStack. If you all agree we can create and prioritize the issue to enable Datadog for these two ASAP. Thoughts?

Hi everyone, just following up on this old thread to see if we could restart the process for getting the OFN Canada server set up with a data analytics tool. Datadog, ElasticStack, and Metabase are all discussed as possibilities, up thread. We are agnostic about the choice of tool, but we are reaching the point where we do need more data analytics capacity than we currently have in order to run the instance effectively. There are a couple of key issues that have caused us to return to this request at this time:

  1. We have had reports from one of our larger hubs that regular customers of theirs have placed orders that have not been received by the hub manager. At least two of the customers were long-standing customers that had placed many orders with this hub in the past, making it unlikely that is was human error. To diagnose the problem properly, we have concluded that can need the ability to search for datapoints that we have been unable to access using super-admin

  2. We are hiring a digital marketing consultant who is going to help setup some automated Mailchimp user journey marketing. As the first stage of this process, they will need access to more fine-grained user related data that we are currently able to gather through Matomo, which is not set up to track sales / ecommerce data. We are hoping to give the consultant access to enough data that they can identify 3 key user personas and develop mail chimp-based customer journeys that are tailored to each one. In order to do this effectively, the consultant will need access to more data than we can currently provide them via Matomo and super-admin.

Reviewing the thread here, and some other old threads on Slack and Discourse, it seems like the key blocker to getting this work done previously was the concern that the OFN Canada server didn’t have enough surplus capacity to handle the extra load, and that the performance of the server was not being actively monitored. It is my understanding that the situation has changed somewhat since these discussion were last active, as our server was upgraded a year or so ago. In addition, we do have some funds to pay a dev (maybe George?) to do this work, if needed

It may be that this comment might be better placed somewhere else (perhaps I should create a new issue on Github – I checked, and all the previous related issues appear to have been closed)? If so, just chime in and let me know where to repost and I’ll get on it – cheers!