I think is the right time to start talking about operations. This stems from two different situations we have experienced recently
1. Caching issues in v2
Working on [Spree Upgrade] Shop showing incorrect products (product cache in v2) · Issue #3391 · openfoodfoundation/openfoodnetwork · GitHub I realized we have little idea of how the Products cache works and more importantly how it performs in the different environments, including development.
2. Downtime in France
On the same line, earlier this week Open Food France experienced a 1h 40min
downtime for reasons we could explain but we have no data to confirm. You have
a detailed postmortem report about the incident in Postmortem report on Open Food France's incident on 03/20/2019 02:47 PM UTC.
TL;DR the suspects are deadlocks and background jobs and their use of the DB.
Suggestions
I essentially think we need to start considering operations as part of the
responsibilities of the core team. We can discuss whether all devs, some of us,
etc. but we need to watch after our instances operation in production if we
want to provide a service to our communities. I’m personally eager to help as
well as grow in this area.
What follows is a list of action items I came up with which I think would make a difference.
Change to a Datadog’s paid plan
Change from the free Datadog plan to both have data retention longer than a day
and use their instrumentation service. This will allow us to investigate
incidents days after they’ve happened (addresses 2), plus monitoring details about the behavior of the app (address 1).
I’m particularly interested in Datadog’s Delayed Job integration, which I hope could help a great deal when debugging situations like the one raised by Kirsten for AU earlier this week. Background jobs monitoring · Issue #318 · openfoodfoundation/ofn-install · GitHub already targets this.
Enable PostgresSQL detailed logging
While firefighting the incident in OFF described above I found myself totally blind about the operations the DB server was performing and thus, I couldn’t pinpoint the root cause of the problem. I already knew it but this time we must:
- Enable pg_stat_statements extension
- Enable pg_stat_activity extension
- Enable regular logging
- Enable Datadog’s PostgreSQL integration
- Configure deadlock logging and handling
Establish a baseline for the v2 roll-out
In v2 we have touched the very foundations of OFN and started taking ownership of many parts of the app most of the currents devs didn’t know before. My point is that like any other major release things can go wrong at first, especially with things we didn’t implement.
To actually know things are going south we need something to compare to. We need to have a set of metrics that will act as a baseline. Things like the number of background jobs executed by time, number of reads and writes to the products cache, just to name a few.
Then, you get the idea. Once one of those values drops or sky-rockets, watch out. We might have introduced a regression. Without this baseline, issues will go unnoticed until it’s probably too late. Obviously, this needs to be done before the v2’s roll-out IMO.
Conclusion
Note that what I’m proposing is not yet another full-featured initiative we should focus on next but a series of small actions that can put us in a much better position. I see this as a 2-weeks effort.
I think our well-being deserves this and much more. What do you think? Let’s discuss.