Introduction
When there are a lot of products to display, the shopfront loads very slowly (~30 s for Stroudco with 386 products). Customers tend to leave when they get frustrated waiting for websites to load, so we’d really like to make it faster.
The products data changes quite infrequently, and stale data (on the order of a minute or so) is acceptable, since customers already experience this when they leave their browser window open for that period of time after a page load. Both of these characteristics make caching a promising approach.
Below are some software design notes on one approach to achieving this.
Refreshing the cache & stale data
The cache will be refreshed asynchronously in a delayed job.
The front end always serves from the cache.
Sometimes this will mean serving stale data, but this is not terrible -
it’s the same effect as if the customer had loaded the page before the change went through.
As long as the data is refreshed in roughly the same amount of time as they might stay on the page, we’re fine.
Experience adding out of stock items to cart
It would be helpful to make the “adding an out of stock product to cart” experience as nice as possible. Here’s a breakdown of what happens now and what the desired behaviour would be.
Add to cart
- Now: Request fails, we retry every 3 secs
- Desired: Alert user, suggest reload, do so asyncronously
Item already in cart, visit cart
- Now: Shows as out of stock
- Desired: No change required
Item already in cart, visit checkout
- Now: Returns to shop (
CheckoutController#raise_insufficient_quantity
) - Desired: Return to cart
Item already in cart, perform checkout
- Now: Silent fail on “Place order now”
- Redirect to /shop, PUT there gives 404 (update action not found on ShopController)
CheckoutController#raise_insufficient_quantity
- Desired: Redirect to cart
- JS should see this redirect and follow it as a full page reload
Development/test environments
In these the products JSON is generated synchronously in each page load of the products, without caching.
Cache keying
products-json-#{shop.id}-#{order_cycle.id}
What do we cache?
JSON, ready to serve
Serving products
Rails.cache.fetch("products-json-#{shop.id}-#{order_cycle.id}") do
Bugsnag.notify(RuntimeError.new("Live server does not have product cache")) if Rails.env.production? || Rails.env.staging?
# Generate and serve products synchronously for test/dev environments
# Generate, cache and serve products for prod environment (used to warm the cache the first time round)
end
Things that trigger a cache refresh
after_save :refresh_cache
after_destroy :refresh_cache
ProductCache.refresh
# Perhaps there's a number of ways to specify what needs to be refreshed
General rule: Update open or upcoming order cycles, but not closed or undated
Order placed
- This changes Variant#count_on_hand or VariantOverride#count_on_hand for !on_demand products
- See below for changes to these models
Changes to top-level models
Variant - Invalidate all distributions the variant appears in
Product - Invalidate all variants
VariantOverride - Invalidate all distributions for the VO’s distributor that this variant appears in
OrderCycle - Invalidate all distributors for this order cycle
EnterpriseFee - Invalidate each order cycle the enterprise fee appears as a coordinator fee in, and invalidate each distribution that it’s in an exchange for
Changes to dependencies
In these examples, CoordinatorFee --> OrderCycle
means that when the CoordinatorFee model is saved or destroyed, this will trigger #refresh_cache
on the corresponding OrderCycle
.
CoordinatorFee → OrderCycle
ExchangeFee → Exchange → OrderCycle
Property → ProductProperty → Product
Taxon → Classification → Product
Image → Product
Enterprise --supplied_products–> Product
Variant (master) → Product
OptionType → OptionValue → Variant
Price → Variant
TaxCategory → EnterpriseFee
Preference → Calculator → EnterpriseFee
Catching missed triggers
It’s possible that we’ll miss a reason to trigger a cache refresh, which will result in stale data every time that trigger is encountered.
Additionally, it would be easy for a developer to add (for instance) a model and not add cache triggering for it.
To identify these problems in staging/production we could have a sanity check job, that runs every ~hour.
This job would generate a fresh version of the products JSON, and diff the pretty-printed version with the cached version.
Any differences would be reported through Bugsnag.
This job would run at lower priority than the cache refresh job, to avoid false failures due to pending cache updates.
Job queuing
At any one time, there can only be one running and one enqueued cache refresh job for any distribution.
If a further update happens, we ignore it, since the enqueued job will include it when it runs.
Resource starvation
If a producer modifies a product that’s included in (say) 10 order cycles, then we’ll need to refresh the cache for each of those.
This would result in enqueuing 10 jobs, each taking up to 30 secs == 5 minutes.
The cache refresh jobs will be given a lower priority than the default so that quicker, time-sensitive jobs (ie. email sending) are not delayed by the full amount.
Even without the priority setting, complete resource starvation should not happen because all jobs sit in a queue, and newly enqueued refresh cache jobs will sit behind any other jobs that have been enqueued in the meantime.
Delayed job workers
The number of workers could be increased to clear a hypothetical queue more quickly.
For a change to a shared product,
total refresh time = cache refresh time x num OCs the product appears in
If this value crosses a threshold (perhaps 2 mins), then we can mitigate that by increasing the number of workers.
While workers run on the same VPS as the unicorns, it would make sense to keep the number of DJ workers <= num_cores-1
This way there will always be at least one core free to serve web requests.
Effect on enterprise user admins
Admins may make changes and not see those changes reflected on the site for a few minutes
We don’t need to expose information about the progress of updates, but we do need to set an expectation around these delays, especially for first-time admins.
Perhaps we could mention the delays in the flash messages after saving items.
We may need to increase the amount of time that the messages are visible to improve readability.
Monitoring delayed job
If delayed job fails, product information will become stale, but there’ll be no obvious alert that this has happened.
It would therefore be useful to report the status of delayed job on the admin dashboard, to super admins.
This report could include: whether the job runner(s) are running, a list of jobs in the queue, a tail of the DJ logfile
Additionally, we could provide explicit errors when a cache refresh job fails
Bootstrapping
When a not-yet-open order cycle is saved, that will trigger a cache refresh, which will build the initial cache for that order cycle.
When this feature is first deployed, we’ll have no cache! When customers first hit the products page, the cache will be generated synchronously, and all subsequent requests will hit the cache.