Ideas for performance as an acceptance criteria

sauloperez · July 2, 2021, 3:44pm

There have been ongoing discussions a couple of times in recent testing meetings about how we could watch after the performance of the app. What follows are the notes and proposals that were brought in those two meetings. However, these are just ideas and we haven’t put anything into practice. We will at some point.

Random thoughts that fuel the conversation

There are no defined requirements. The app should be “fast” and “scalable”, but no one agrees on what those words mean.
Perf. metrics and alerts are a know-how silo. Does anyone look at Datadog? No way for performance work to ever get into the delivery pipeline
Since there are no standards for failure, any perceived performance change can always be punted into the future
There’s probably no understanding of whether or not the app is slow at all
Moving performance into a pass/fail workflow requires extra work up front to define requirements, but provides smooth sailing afterward. Once a threshold is achieved a regression should become a bug.

Red/green setup

Firm up your requirements, and pass/fail them automatically.

Nate Berkopec at The Red-Green Performance Approach

We agree on perf. Metrics for the entire app. 4 or 5 should do but we might need to fix on 1 or 2 first.
Once the metric-driven requirements we can turn them into monitoring alerts. Watched by a computer rather than a human.
When failing, we treat them as a bug and put them into the pipe.
With high-level numbers (more about this below), we can “back in” to lower-level metrics as well, such as server response times.

See: Monitor Core Web Vitals With Datadog RUM and Synthetic Monitoring | Datadog

Overarching metrics

Core Web Vitals

We need to adopt industry standard and holistic metrics and go top-down, to more specific ones. That’s what Google’s Core Web Vitals and Apdex are. See Web Vitals | Articles | web.dev and https://www.apdex.org/overview.html.

I propose we start focusing only on Largest Contentful Paint (LCP) (part of Core Web Vitals): measures loading performance. To provide a good user experience, LCP should occur within 2.5 seconds of when the page first starts loading.

Then, as we get to that standard number and rewrite more parts of the app with the new FE stack, we can focus on the other two metrics from CWV.

Server responses are a small part of LCP. To have a decent LCP, the average response time should be approx. 175ms. We track this with Apdex.

Apdex

How well the application performs from a business point of view. Apdex is a numerical measure of user satisfaction with the performance of enterprise applications. It converts many measurements into one number on a uniform scale of 0-to-1 (0 = no users satisfied, 1 = all users satisfied).

We aim for a T of 4s, which leads to an LCP of 5.7s. That is awful but a baseline to start with. We need to rinse and repeat until we get to a T of 700ms (industry standard). Meanwhile, we should also monitor and see how this lowers LCP too (this means more anonymous user tracking). We mustn’t forget that Apdex applies to API too! The consumer will pay the LCP on their app.

Proposal

I propose we make the T above a requirement. That is two-fold: it becomes an acceptance criteria for all new epics but it also needs to be watched to catch regressions. Then, as we lower that T, so it will the acceptance criteria for future epics.

How to track this

Before merge, we could follow a green/red approach, with something like Lighthouse CI (to be explored). We can gradually enable more assertions as metrics improve, although we might start only with LCP.

A CI build doesn’t account for production-env variables: data, configuration, load, concurrency, etc. That’s why, after merge, we should monitor Apdex from Datadog (+ Datadog’s RUM?). We should then define alerts based on Apdex drops and creating bug reports out of them. Disclaimer: we shouldn’t flood the pipe. We need to find a balance.

TL;DR:

Investigate Google CI to track performance regressions - lab conditions
Set up the proposed values on Datadog - experimental - field conditions

Prioritization

TBD. This will require digging deeper and see which endpoints drag Apdex down. Aka. Datadog’s list sorted by total time, top to bottom. We can then choose epics to work on.

How do we do that in terms of priority? Identify these epics and prioritize them on a regular basis so we always work on performance but well scoped?

Then, who watches after Datadog’s alerts? Is it worth having someone other than devs looking into this? Product? Testing?