Monitoring our APIs

filipefurtado · September 10, 2023, 7:38pm

What’s the problem

We currently don’t have an automated process to track API changes: if code changes and the corresponding specs are updated (i.e., green build) we may end up shipping changes which can break integrations.

Some background

We’ve had some unnoticed changes on endpoints which were not announced to instance managers. This was discussed this on the latest delivey-circle meeting, which followed-up on recent v0-integration outages, which happened here and here.

First measure

The first agreed change, was to better flag API changes, within the release process. This was done here, and introduces a dedicated secion when creating PRs, so that we can automatically generate release notes and timely signal these changes.

How to automate?

We’ve discussed briefly on how to automate and monitor these changes, on the unsupported v0, but also for the v1 and DFC APIs.

Two main ideas came about:

i) @maikel proposed to have a script which screens for spec changes, when drafting releases. This way, even if a developer forgets to signal a PR as API-changing, the script could catch it, and signal this to the release manger, which would in turn signal this to instance managers. This would happen before the release is deployed into production.

ii) @lin_d_hop proposed to use Postman or some other tool to assess which endpoints are used the most, and have a process for actively monitoring endpoints. This could happen either at staging (before), or at production servers (after deployment into production).

On proposal ii)
I’ve had an exploration of Postman to monitor a given endpoint, and found that:

it’s fairly easy to set up monitors, which make real requests and check for payloads with a Javascript test. As a proof of concept, one can see that the endpoint monitoring goes from green to red, when different releases are staged, on a staging server.
monitors can be set up in collections which can (as far as I understood) be ran as part of a GitHub Action (by using Newman, a CLI tool to run Postman collections) - I have not tried this
we’re currently on the process to replace Datadog with New Relic. It seems possible to integrate Postman API monitoring with New Relic. It could be a nice to have, to see all monitoring in one place. I have not tried this either.

Summary and Open questions

Proposal i) sounds like a quick win, which might not take too much time to implement.
Proposal ii) might be something to aim for, as a process to back up all API work, and assure integrations work. There might be several ways to achieve this. The Postman-Newman-New Relic process may have some pitfalls though:

cost? I think there might be a limit to the number of tests endpoints we can use on a free basis
tests are in Javascript - I could not find an easy way to use rspec while monitoring with Postman
how seamless is this really? I’m wondering what others think on this approach

Other more general questions

monitoring production or staging? Monitoring production has obvious wins, but may have some downsides as well:
- production monitoring: we see endpoints breaking only after shipping, which may be too late; may impact performance?
- staging monitoring: the disadvantages above don’t apply, but tests in staging are always less valuable than production tests, as we may fail to take into account production configurations (timezone, traffic, etc)
is there a tool/process to know which integrations use which endpoints the most?
what other process/tools could be used instead, to actively monitor our APIs?
how should we prioritize this exploration?

Plenty of questions. Thoughts?

maikel · September 11, 2023, 1:22am

As developers, we don’t know all the integrations that are out there. So we can’t set up monitoring for them. I think that we should encourage the API consumers to monitor with tools like Postman and set up their own alerts. When it detects issues in production, it’s a bit late though. Pointing to staging won’t work though because it would generate lots of false alarms when testing out buggy pull requests.

We could check the Postman monitoring during release tests though. Then we don’t need alerts and can check that all monitors are still green. We need a way for API consumers to contribute to those monitors. Is there a public monitor page to check if existing monitors are still passing? Then each API consumer could manage their own monitors and release testers just have to look at those pages to check.

filipefurtado · September 11, 2023, 1:27pm

Yes, agree @maikel . Indeed this is part of the challenge, also for testing. Maybe we can’t know about integrations, but perhaps about used endpoints.

I’ve found this gem: GitHub - renuo/rails_api_logger: An Inbound and Outbound requests logger for your Rails application

It should log requests in two tables, outgoing/incoming with these attributes:

Seems to be maintained regularly. Maybe good to give it a go, and verify if it is usable to assess which endpoints are most used?

dcook · September 12, 2023, 12:03am

Thanks for looking into this Filipe.

From my point of view, these proposals are an extension of our “specification”. We currently aim to specify all behaviour of the app with tests. In the case of v0/products, I think we currently cover this with spec/controllers/api/v0/products_controller_spec.rb.

It’s also worth pointing out that we don’t officially support v0, with this being made clear in the API handbook:

If you choose to use the OFN v0 API you do so at your own risk.

So initially I don’t agree to add more overhead to v0.
I do, however recognise that these endpoints are being used, so I think we should discuss promoting them to v1, and prioritise what tasks are needed in order to do that. Is there a roadmap/epic for this?

So my question is, would we make the same mistake for v1, and should we add further protection for supported v1 endpoints? Regarding the proposals in light of this:

This checking should already happen as part of our development process: when developer makes a change, which includes modifying the specification. But the developer is trying to co-ordinate various requirements and may lose sight of the effect of these changes, so…

We add a new layer that runs at a different time, which looks simply for changes to the API? When would this run, with CI? Or as Maikel suggests, we have a way of checking it on a staging environment when testing a release. Sounds good, but extra work. I like the idea of users being able to provide their own specification.

But it seems me that it would be better to work more on a solid specification managed within the Open API format. Changes to this can be clearly communicated on a Thursday via #instance-managers.

sigmundpetersen · September 12, 2023, 6:42am

dcook · September 12, 2023, 7:50am

Thanks, I took a look through the links and it looks like it was decided to not progress the products endpoint to v1. I’m guessing we decided to focus on DFC products endpoint instead.

So perhaps that’s why we’re in this in-between state and trying to find other ways to improve the situation…

maikel · September 13, 2023, 4:21am

Moving endpoints to v1 is another breaking change that doesn’t solve anything. We would want to write proper specs for the endpoints when we do that and that’s what could be useful. We can do that for v0 as well though and close the gap without promising to keep it stable. We actually want to phase out v0 and didn’t want to pour work into it, not even specs.

In the last example of breaking API changes, there was actually a spec change. Nobody picked it up though. So the new visibility we have is maybe everything we should do for now.

dcook · September 14, 2023, 4:41am

I had a go at reviewing changed API spec files while drafting the release. It was a big one so a good test. The git commands I used, and results are here:

github.com/openfoodfoundation/openfoodnetwork

Release v4.4.11

opened 03:49AM - 14 Sep 23 UTC

dacook

## 1. Preparation on Thursday - [x] Merge pull requests in the [Ready To Go] …column - [x] Include translations <details><summary>Command line instructions:</summary> <pre> <code> git checkout master git pull upstream master tx pull --force git commit -a -m "Update all locales with the latest Transifex translations" git push upstream master </code> </pre> </details> - [x] Create a tag: `git push upstream HEAD:refs/tags/vX.Y.Z` - [x] [Draft new release]. Look at previous [releases] for inspiration. - Select new release tag - _Generate release notes_ and arrange into categories as required. - [x] Notify [#instance-managers] of user-facing changes. ## 2. Testing - [x] Move this issue to Test Ready. - [x] Notify `@testers` in [#testing]. - [ ] Test build: [Deploy to Staging] with release tag. - Please also check on the cart page that the "Update" button appears as expected, using French (which has longer label "mettre a jour") ## 3. Finish on Tuesday - [ ] Publish and notify [#global-community] (this is automatically posted with a plugin) - [ ] Deploy the new release to all managed instances. <details><summary>Command line instructions</summary> <pre> cd ofn-install git pull ansible-playbook --limit all-prod --extra-vars "git_version=vX.Y.Z" playbooks/deploy.yml </pre> </details> - [ ] Notify [#instance-managers]: > @instance_managers The new release has been deployed. - [ ] Nudge next release manager The full process is described at https://github.com/openfoodfoundation/openfoodnetwork/wiki/Releasing. [Ready To Go]: #zenhub [Transifex pull request]: https://github.com/openfoodfoundation/openfoodnetwork/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+head%3Atransifex [Draft new release]: https://github.com/openfoodfoundation/openfoodnetwork/releases/new?tag=v&title=v+Code+Name&body=Congrats%0A%0ADescription%0A%0A%23%23+User+facing+changes+:eyes:%0A%0A%0A%23%23%23+Experimental+features+for+testing+:sunglasses:%0A%0A%0A%23%23+Technical+changes+:wrench:%0A%0A [releases]: https://github.com/openfoodfoundation/openfoodnetwork/releases [#instance-managers]: https://app.slack.com/client/T02G54U79/CG7NJ966B [#testing]: https://openfoodnetwork.slack.com/app_redirect?channel=C02TZ6X00 [Deploy to Staging]: https://github.com/openfoodfoundation/openfoodnetwork/actions/workflows/stage.yml [#global-community]: https://app.slack.com/client/T02G54U79/C59ADD8F2

What I found was that there were several spec files changed, but only some where functional changes. These were correctly in the respective PRs. It adds an extra manual step of reviewing these changes, but I guess that’s the idea