Making operations a first-class citizen

I think is the right time to start talking about operations. This stems from two different situations we have experienced recently

1. Caching issues in v2

Working on [Spree Upgrade] Shop showing incorrect products (product cache in v2) · Issue #3391 · openfoodfoundation/openfoodnetwork · GitHub I realized we have little idea of how the Products cache works and more importantly how it performs in the different environments, including development.

2. Downtime in France

On the same line, earlier this week Open Food France experienced a 1h 40min
downtime for reasons we could explain but we have no data to confirm. You have
a detailed postmortem report about the incident in Postmortem report on Open Food France's incident on 03/20/2019 02:47 PM UTC.

TL;DR the suspects are deadlocks and background jobs and their use of the DB.

Suggestions

I essentially think we need to start considering operations as part of the
responsibilities of the core team. We can discuss whether all devs, some of us,
etc. but we need to watch after our instances operation in production if we
want to provide a service to our communities. I’m personally eager to help as
well as grow in this area.

What follows is a list of action items I came up with which I think would make a difference.

Change to a Datadog’s paid plan

Change from the free Datadog plan to both have data retention longer than a day
and use their instrumentation service. This will allow us to investigate
incidents days after they’ve happened (addresses 2), plus monitoring details about the behavior of the app (address 1).

I’m particularly interested in Datadog’s Delayed Job integration, which I hope could help a great deal when debugging situations like the one raised by Kirsten for AU earlier this week. Background jobs monitoring · Issue #318 · openfoodfoundation/ofn-install · GitHub already targets this.

Enable PostgresSQL detailed logging

While firefighting the incident in OFF described above I found myself totally blind about the operations the DB server was performing and thus, I couldn’t pinpoint the root cause of the problem. I already knew it but this time we must:

  • Enable pg_stat_statements extension
  • Enable pg_stat_activity extension
  • Enable regular logging
  • Enable Datadog’s PostgreSQL integration
  • Configure deadlock logging and handling

Establish a baseline for the v2 roll-out

In v2 we have touched the very foundations of OFN and started taking ownership of many parts of the app most of the currents devs didn’t know before. My point is that like any other major release things can go wrong at first, especially with things we didn’t implement.

To actually know things are going south we need something to compare to. We need to have a set of metrics that will act as a baseline. Things like the number of background jobs executed by time, number of reads and writes to the products cache, just to name a few.

Then, you get the idea. Once one of those values drops or sky-rockets, watch out. We might have introduced a regression. Without this baseline, issues will go unnoticed until it’s probably too late. Obviously, this needs to be done before the v2’s roll-out IMO.

Conclusion

Note that what I’m proposing is not yet another full-featured initiative we should focus on next but a series of small actions that can put us in a much better position. I see this as a 2-weeks effort.

I think our well-being deserves this and much more. What do you think? Let’s discuss.

2 Likes

Sounds good Pau.

Datadog and pg stats: :+1:

re the baseline metrics, how exactly do you want to collect the metrics in the short term before v2?

Re the baseline, I would also use Datadog. Those Delayed Job, Postgres and Rails integrations they have will give us counts and gauges to compare against. Also, we can send custom metrics of our own.

This all goes back to the point of “Change to a Datadog’s paid plan”. I’m up for Katuma paying for it as we did with Bitwarden as another contribution to the pot. Let’s see if others in the team agree.

Thank you for bringing this up, Pau.

Change to a Datadog’s paid plan

Yes! That would be very useful. We worked with Wormly which offers us that data retention and it’s very useful. Using Datadog’s paid plan would be even better and available for the whole dev community.

Enable PostgresSQL detailed logging

Sounds good but I can’t judge it. We should investigate what the possible performance impact is before we enable all that logging.

things are going south

Hey, are you saying going to Australia is a bad thing? :sunny:

Establish a baseline for the v2 roll-out

Good idea. I guess we can watch these metrics with Datadog, right?

We should investigate what the possible performance impact is before we enable all that logging.

I will need to dig into Postgres doc to get some insights but for my experience with similar MySQL utilities, the difference is negligible although we had better and larger hardware back then. In any case, there is hardly any way around it IMO. Obviously, we must keep an eye on it.

Hey, are you saying going to Australia is a bad thing? :sunny:

You know how much I love it :heart:

I guess we can watch these metrics with Datadog, right?

That’s my idea. Watch at those graphics closely and then perhaps investigate setting alarms based on their values.

For the record, we’ll need to update https://github.com/openfoodfoundation/ofn-install/wiki/Server-monitoring as we move forward.