Product caching software design

RohanM · January 8, 2016, 5:38am

Introduction

When there are a lot of products to display, the shopfront loads very slowly (~30 s for Stroudco with 386 products). Customers tend to leave when they get frustrated waiting for websites to load, so we’d really like to make it faster.

The products data changes quite infrequently, and stale data (on the order of a minute or so) is acceptable, since customers already experience this when they leave their browser window open for that period of time after a page load. Both of these characteristics make caching a promising approach.

Below are some software design notes on one approach to achieving this.

Refreshing the cache & stale data

The cache will be refreshed asynchronously in a delayed job.
The front end always serves from the cache.
Sometimes this will mean serving stale data, but this is not terrible -
it’s the same effect as if the customer had loaded the page before the change went through.
As long as the data is refreshed in roughly the same amount of time as they might stay on the page, we’re fine.

Experience adding out of stock items to cart

It would be helpful to make the “adding an out of stock product to cart” experience as nice as possible. Here’s a breakdown of what happens now and what the desired behaviour would be.

Add to cart

Now: Request fails, we retry every 3 secs
Desired: Alert user, suggest reload, do so asyncronously

Item already in cart, visit cart

Now: Shows as out of stock
Desired: No change required

Item already in cart, visit checkout

Now: Returns to shop (CheckoutController#raise_insufficient_quantity)
Desired: Return to cart

Item already in cart, perform checkout

Now: Silent fail on “Place order now”
- Redirect to /shop, PUT there gives 404 (update action not found on ShopController)
- CheckoutController#raise_insufficient_quantity
Desired: Redirect to cart
- JS should see this redirect and follow it as a full page reload

Development/test environments

In these the products JSON is generated synchronously in each page load of the products, without caching.

Cache keying

products-json-#{shop.id}-#{order_cycle.id}

What do we cache?

JSON, ready to serve

Serving products

Rails.cache.fetch("products-json-#{shop.id}-#{order_cycle.id}") do
  Bugsnag.notify(RuntimeError.new("Live server does not have product cache")) if Rails.env.production? || Rails.env.staging?

  # Generate and serve products synchronously for test/dev environments
  # Generate, cache and serve products for prod environment (used to warm the cache the first time round)
end

Things that trigger a cache refresh

after_save :refresh_cache
after_destroy :refresh_cache
ProductCache.refresh
# Perhaps there's a number of ways to specify what needs to be refreshed

General rule: Update open or upcoming order cycles, but not closed or undated

Order placed

This changes Variant#count_on_hand or VariantOverride#count_on_hand for !on_demand products
See below for changes to these models

Changes to top-level models

Variant - Invalidate all distributions the variant appears in
Product - Invalidate all variants
VariantOverride - Invalidate all distributions for the VO’s distributor that this variant appears in
OrderCycle - Invalidate all distributors for this order cycle
EnterpriseFee - Invalidate each order cycle the enterprise fee appears as a coordinator fee in, and invalidate each distribution that it’s in an exchange for

Changes to dependencies

In these examples, CoordinatorFee --> OrderCycle means that when the CoordinatorFee model is saved or destroyed, this will trigger #refresh_cache on the corresponding OrderCycle.

CoordinatorFee → OrderCycle
ExchangeFee → Exchange → OrderCycle

Property → ProductProperty → Product
Taxon → Classification → Product
Image → Product
Enterprise --supplied_products–> Product

Variant (master) → Product

OptionType → OptionValue → Variant
Price → Variant

TaxCategory → EnterpriseFee
Preference → Calculator → EnterpriseFee

Catching missed triggers

It’s possible that we’ll miss a reason to trigger a cache refresh, which will result in stale data every time that trigger is encountered.
Additionally, it would be easy for a developer to add (for instance) a model and not add cache triggering for it.
To identify these problems in staging/production we could have a sanity check job, that runs every ~hour.
This job would generate a fresh version of the products JSON, and diff the pretty-printed version with the cached version.
Any differences would be reported through Bugsnag.
This job would run at lower priority than the cache refresh job, to avoid false failures due to pending cache updates.

Job queuing

At any one time, there can only be one running and one enqueued cache refresh job for any distribution.
If a further update happens, we ignore it, since the enqueued job will include it when it runs.

Resource starvation

If a producer modifies a product that’s included in (say) 10 order cycles, then we’ll need to refresh the cache for each of those.
This would result in enqueuing 10 jobs, each taking up to 30 secs == 5 minutes.
The cache refresh jobs will be given a lower priority than the default so that quicker, time-sensitive jobs (ie. email sending) are not delayed by the full amount.

Even without the priority setting, complete resource starvation should not happen because all jobs sit in a queue, and newly enqueued refresh cache jobs will sit behind any other jobs that have been enqueued in the meantime.

Delayed job workers

The number of workers could be increased to clear a hypothetical queue more quickly.
For a change to a shared product,
total refresh time = cache refresh time x num OCs the product appears in
If this value crosses a threshold (perhaps 2 mins), then we can mitigate that by increasing the number of workers.
While workers run on the same VPS as the unicorns, it would make sense to keep the number of DJ workers <= num_cores-1
This way there will always be at least one core free to serve web requests.

Effect on enterprise user admins

Admins may make changes and not see those changes reflected on the site for a few minutes
We don’t need to expose information about the progress of updates, but we do need to set an expectation around these delays, especially for first-time admins.
Perhaps we could mention the delays in the flash messages after saving items.
We may need to increase the amount of time that the messages are visible to improve readability.

Monitoring delayed job

If delayed job fails, product information will become stale, but there’ll be no obvious alert that this has happened.
It would therefore be useful to report the status of delayed job on the admin dashboard, to super admins.
This report could include: whether the job runner(s) are running, a list of jobs in the queue, a tail of the DJ logfile
Additionally, we could provide explicit errors when a cache refresh job fails

Bootstrapping

When a not-yet-open order cycle is saved, that will trigger a cache refresh, which will build the initial cache for that order cycle.

When this feature is first deployed, we’ll have no cache! When customers first hit the products page, the cache will be generated synchronously, and all subsequent requests will hit the cache.

lin_d_hop · April 18, 2016, 7:11am

@danielle @RohanM
The UK tax settings were incorrect and on updating them all the prices of VATable products are incorrect in the shopfronts.

Am I correct in thinking that updating Tax rates does not trigger a cache refresh? If so I’ll create a bug and look into it…

danielle · April 19, 2016, 7:20am

@lin_d_hop I’ll chat with @RohanM and @oeoeaio about this tomorrow when we’re in the office together and get back to you. We’ll make it a priority

Oliver · April 20, 2016, 9:26am

I’d be thankful for some reassurance. Do you know how the tax settings changed in the first place when they would have been correct previously?

danielle · April 22, 2016, 4:37am

Hey @lin_d_hop is the issue with UK tax settings something that would be solved by implementing #725?