Reports Project

We had some initial discussions about tech strategy in terms of implementation. Some of the key talking points were:

Overlap of use-cases

In the three identified use-case categories there is a lot of potential overlap. We are possibly looking at multiple solutions for these three cases instead of one single solution.

Using Metabase for all aspects of reporting

Summary:

Proposal to use Metabase for all reporting needs.

Pros:

  • we wouldn’t have to rewrite / maintain in-app reports

Cons:

  • would require some way of re-implementing OFN permissions logic within Metabase, which could be very complex
  • depends on the Kafka-based datastreaming setup which is currently in beta
  • might not be possible to cover all reports requirements outside of OFN business logic

Shifting reports into the API

Summary:

Shift the reports output to API endpoints and have the in-app reports consume those endpoints when generating reports.

Pros:

  • could improve in-app reports and increase available data-extraction options at the same time

Cons:

  • potential increased resource use?

Pointing reports queries at replicas

Summary:

Generating reports is resource intensive and currently hits the production database. We could point the reports queries at (read-only) database replicas to improve this.

Note: this option is not mutually exclusive with other proposals, it’s something we can consider alongside any other options.

Pros:

  • reduce load on production databases, improve timeouts

Cons:

  • depends on a replication system, either with Kafka-based datastreaming or another option
1 Like

Reports Spike: “Shifting reports into the API”

I did a little spike on this option to see what it would be like. I picked a report at random and made a quick POC roughly as follows:

  • made a new API endpoint for the report
  • fetched the relevant data
  • passed it to a new serializer, with an attribute for each column in the report
  • made a quick Angular service that queried the endpoint and displayed the results in a table
  • did some before and after performance comparison, loading the same dataset with both the old and new methods

Good Results

  • I found a really nice performance improvement by accident that affects multiple reports, submitted a quick PR
  • It seemed like the code could be made much cleaner and simpler, and the controllers a lot slimmer :fire: :fire:
  • Running it through Angular would potentially open up more UX options…?

Bad Results

  • The performance was significantly worse, with total page load time around 25% higher. This is probably due to the added rendering layer; essentially rendering the data as serialized JSON then rendering the JSON as HTML added too much overhead.

Conclusion

This probably isn’t a good idea. The spike was useful though.

2 Likes

Reports Code Analysis

I did some general analysis of the existing reports code, looking at where the issues and complexities are and what would be required in a refactor/rewrite.

General Issues

  • the classes that define each report are scattered all over the codebase (it’s incredible how many different locations they reside in)
  • in some cases the structure and methods used are quite different from one report to the next, there are various approaches and styles
  • a lot of the code is obtuse and difficult to follow
  • the reports are quite inflexible, so for example if we wanted to add a nice UX feature like the ability to choose which columns were exported (show/hide) when downloading the report as a spreadsheet, it would be a nightmare
  • the code around permissions needs improving. It’s not reports-specific, it applies to the whole app

Grouping and Summarising

The primary point of code complexity is in the places where we apply rules for grouping results and adding “summary rows” for those groupings. So for example, a report might list line_items grouped by order, and after each set we inject a row that shows some summed/aggregated values for selected columns, like quantity total, before showing the next set, eg:

---------------------------------
| order   | item     | quantity |
---------------------------------
| order1  | carrots  | 5        |
| order1  | broccoli | 3        |
---------------------------------
| TOTAL   |          | 8        |
---------------------------------
| order2  | carrots  | 2        |
| order2  | parsnips | 1        |
---------------------------------
| TOTAL   |          | 3        |
---------------------------------

This “grouping results and injecting summary rows” is a key part of the report-generating process. If we can find a way to do it really nicely and cleanly, we can make everything a lot prettier.

Masking Customer Data

We have some really nasty code for conditionally hiding certain customer info in a few reports based on permissions. Summary of the current state:

  • The code for hiding customer data is really awful in terms of performance and gets run in every single report that queries line items (most reports).
  • There are only actually 5 reports out of 27 where customer data even potentially needs to be hidden.
  • Of these 5 reports, in some cases there is only a single field hidden. In other cases there are also other fields present in the report that should probably not be shown to these users with limited permissions either, but they are shown.
  • There are two separate versions of this same code for hiding the customer data, kept in two different places (see Reports::LineItems and OrderAndDistributorReport).

As with the “grouping and summarising”, if we handled this bit nicely we could could clean everything up and radically improve performance at the same time.

Broad Objectives

  • Report controllers should be slim and clean
  • Report classes where we define what data should be fetched and how the data should be grouped and aggregated (including “summary rows”) should be: slim, easy to read, easy to change and easy to create
  • Once a report object is created, it should contain a nice representation of the resulting dataset which is easy to manipulate and easy to render/convert to a broad range of formats
  • We should be able to render that object directly to (for example): a HTML table, a CSV spreadsheet, an Excel spreadsheet, an OpenOffice spreadhseet, JSON, etc
  • For UX purposes the flexibility of the report object is really important
2 Likes

Nice work Matt.
I think that “Refactoring the existing reports solution” is a 4th option (not option 1 metabase, not option 2 API based, and maybe not even option 3 connect them to a replica) and deserves it’s pros and cons.

Refactoring the existing reports solution

Described in details above by Matt.

Pros
  • Flexibility to build the exact reports we want, specially in terms of UI (I am not sure how flexible metabase is in terms of exporting data for example)
  • Re-use existing biz logic already implemented for existing reports (:warning: performance challenge)
Cons
  • Highest maintenance cost of the options available with no re-use of API code and no re-use of UI features of metabase

Other options

Regarding the other options, I can add a few more pros and cons to the 3 options:

2 more pros for using Metabase for all aspects of reporting:

  • NEW we would rely on a standard open source reporting tool getting their regular features and updates
  • NEW we would get all the metabase features like the visualizations to the enterprise users

2 more pro for Shifting reports into the API

  • NEW API would be the single point for implementing and simplifying the logic behind reports data
    and more 1 con:
  • NEW We would need to get hyper fast json rendering which would be a good thing for the API as a whole

1 more pro for Pointing reports queries at replicas

  • NEW decrease risk of reports crash the app with it’s load

This comment by Matt made me wonder. If there is a strong demand for UX tweaks of reports by users then is it a viable option to keep the system component that generates the report data quite ‘dumb’ and then pass some of the complexity of report customisation on to the client?

I’m thinking that examples like this grouping of summary rows could be quite performant to do at the client, such as on screen report viewing (just the way you like it) or downloading as PDF/CSV, manipulating in Metabase.

If so this makes me lean towards the OFN API. It’s job is to do the basics, provide (Extract) data that pretty much maps to the OFN data models. Other tools/clients then transform/load that data as needed, and any performance issues with transformation/load are the reposnsibilty of the report client.

I recalled agreement that one of the ‘exploration’ tasks was to define the useful ‘mega-report’ and investigate building from scratch one nice new clean perfect in-app report, that provided us with (e.g.) line items with correct taxes and fees . .

just re @Matt-Yorkley’s in-app investigation above and @luisramos0 comment about in-app report refactoring being a 4th option - it looks to me like it is a strategy investigation under option 1. My notes had it structured like below. I hadn’t put them in because they were the same as Matt’s Use-Case categories, but now I think this structure might also be useful

1. “IN-APP” REPORTS

Strategy investigation:

  • a) new report/s that are built beautifully and give us the key data we need - @lin_d_hop to define / suggest?
  • b) do from database replica rather than main database
  • c) gems
  • d) ‘re-factor’

2. DATA EXTRACTION

  • a) API
  • b) other formats - odt etc
  • c) ?

3. Business Analytics

  • a) Metabase - Kirsten exploring, others exploring - Q. spend half day to get Aus data set up so that we can closely investigate how easy it is for us to build the reports our customers want (much easier with familiar data). ping @chez
  • b) replicas - Matt building roadmap for getting effective replica set up and hooked in (to be considered at next product curation?)

Reports Spike: Available Report-Building Gems

I did a quick spike on Ruby gems that could possibly help with the issues outlined in the Reports Code Analysis above, including things like:

  • handling some of the more complex aggregating/summarising functionality more cleanly
  • allowing for more readable and concise report definitions
  • allowing for easier manipulation of the report that might be useful for UX functionality, etc.

Conclusion

There were a few interesting frameworks/solutions, but they were generally not very lightweight, a bit over-engineered, bad on performance (I did some quick experimenting) or provided nice functionality but in a very limited way.

So it doesn’t look like there’s a gem that will adequately cover our reports needs.

I made a quick POC on what a reports rewrite might look like, based on both the Reports Code Analysis and some inspiration from features I saw when looking at report-generating gems (although none of the gems ultimately seemed like a viable/usable solution). The POC looks like this:

This was more of a “glass-of-scotch-and-some-leisurely-hacking” kind of thing at the weekend, but if we want to go in this direction I could do some actual work on it.

TLDR; it works really well, covers all the objectives in the Reports Code Analysis, and I think it would enable all the requirements outlined in Harriet’s analysis to be completed really easily.

Just a quick opinion on the “do everything in Metabase” option:

  • I think it would be unfeasible to implement all of the required permissions and other logic we would need inside of Metabase in order to cover all the reports requirements
  • If we went in that direction I think it would actually be incredibly expensive and we would not have anything that was production-ready for a long, long time…

I do think Metabase is awesome for certain things and certain cases, and we should explore it more, but I don’t think it’s a viable option for reports as a whole. If you want to see things like quarterly sales data visualised in a beautiful graph plotted against last year’s data (visualisations of broad queries on raw data), Metabase is incredible. If you want an Excel packing sheet for an order cycle, with per-user and per-enterprise permissions and various bits of app-specific business logic applied, Metabase is not the right tool.

Based on everything that has come up so far we seem confident that we have 3 ways to access OFN data that we wish to support:

  • In App
  • Business Intelligence
  • API

From the great tech spike work that @Matt-Yorkley we have lots of valuable insights. In particular it seems clear that we need to follow all three routes in order to satisfy our report requirements and that no one of these will be the one that rules reports totally.

So in order to create a Roadmap for Reports we need to understand the priorities for each of the routes.

In App

Looking at the summary of requirements it seems that we have a few ways forward for In App reports:

  1. Focus on refactoring existing reports one by one, improving consistency and fixing bugs as we go.
  2. Start the refactor with a new report to solve the pressing enterprise fees and tax issues. This will serve as a trial for the reports refactor and we can bring other reports in as we go.
  3. Start the refactor by enabling a ‘mega-report’ that solves most reports needs and forms the basis of the reports refactor by rendering other reports obsolete. The mega-report could replace most reports if it enabled:
  • Querying orders with and without line items
  • Includes all the key search fields from reports - order cycles, distributor, shipping/payment methods, date range
  • Sort by Distributor, Producer and Shopper
  • Totals and aggregation
  • Hide and display a million potential fields

It is likely that an approach of developing a new report will be lower risk as we can leave the existing reports alone while we do it and experiment with a new reports approach.

EDIT - Note that In App reports can be available via the API in the refactored form that Matt has identified with no extra work.

EDIT - To be clear the priority reports problem to solve is Tax… though the questions remain about the scope and product strategy of this report.

So questions:

  • If we go for a new report what is the scope of it?
  • What are our priorities and scope for the first iteration of updates for In App reports?
  • Is there a design task to bring consistency to report field conventions and naming that should be done before a new report?
  • Or would it be better to design a new report and then clean up the old reports afterwards?

Business Intelligence

It is clear that we have a number of large users that need bespoke reports to solve their requirements. We’ll never be able to satisfy all our users specific needs when it comes to reports and BI so using an external system built for this is a great solution. Go Metabase!

With not much work we can set up metabase for super admin access on the same DB that the app uses. This is fine for some cases can gives superadmin more of the power they need but it does not solve all the problems.

Ideally we want users to be able to access all the power themselves. But this means recreating the OFN permissions logic in Metabase and that logic is complex. There are infrastructure challenges inherent as well - see below.

So the questions here:

  • Where does the Metabase roll out sit in the reports priorities?
  • Does it make sense to create super admin only Metabase for each instance as they need it? Eg France, Aus, Katuma all have their own Metabase on their app DB.
  • Would it be safer to create duplicate DBs for all instances using Metabase?
  • Are there bigger gains if we instead focus on aggregating duplicate databases so we don’t have to duplicate work in Metabase?
  • Or is it a higher priority to get the OFN permissions system into Metabase to enable users to access?
  • And how does this sit alongside the other reports priorities in our Roadmap?

API

From the investigations is have become clear that just ‘putting reports on the API’ posed problems with performance, both in terms of our json rendering (its slow) and in the complexity of aggregation. (EDIT - In reports refactoring Matt has shown that it is technically possible to put all reports on the API efficiently). If we need to make an end point for every report then the actual data gains found might not be all that worthwhile.

The primary data extraction needs via API are Orders with and without line items. The API already has endpoint for these, imperfectly. The great work done here suggests that almost all the search parameters we could dream of will filter correctly. However do have a bit of a mess in the responses - in particular adjustments, fees and taxes are a mess. We know that we need Bye Bye Spree before we can refactor the adjustments code.

There is much to be done to understand the data structure that we want for our json outputs. Fortunately the API is already out of Spree (hooray!) so we can being this work as we wish.

Questions:

  • Where does the API work sit in our priorities?
  • Is there a design process that should be starting earlier to understand our real needs from the API?

Infrastructure

As data extraction and reports put a high load on servers there is much talk of replica databases. One proposal has been to have all instances join together their data on a single replica DB. This offers huge potential advantages when it comes to integrations, allowing all instances to benefit from Zapier integration, Metabase reports and other API integrations without reinventing or needing their own tech team.

So, questions:

  • So from a product perspective is this a direction we want to go?
  • Do we want to prioritise this before other work?
  • What else do we need to know from a technical perspective that will help to position this work in the priorities?

Shout out to @Matt-Yorkley for the absolutely awesome technical investigations that have helped us understand where to go with reports!

Seems to me that we are now at a place in which we want to create a Product Roadmap for the reports work so that we can understand the priorities within and between the three streams and start to plan the work.

This is all intended to be food for thought for the reports meeting on Tuesday 23rd June. See y’all there.

3 Likes

Just to clarify a bit on the API point; the initial idea was moving reports into the API, which would mean any in-app reports (on screen or exported as spreadsheets) would have to consume that API before doing anything. This ended up being incredibly inefficient, as it went through a process of: fetch the data -> run it through serializers -> render it as JSON -> run it through Angular -> render the data a second time as HTML. The process added about 25% to the total page load, and it would have been similar for exporting to other formats.

The new proposal I’ve made would would have the same benefits of that initial API option (all reports would be accessible from the API, for no extra work) but the code would live in a slightly different place (not primarily in the API) and the result is that generating any in-app reports (on screen or as spreadsheet files) would be twice as fast instead of +25% slower.

1 Like

Thanks for this great update @lin_d_hop :slight_smile:

I thought it was reports on tax the first priority on in app reports? Did I missunderstood?

1 Like

Yes tax came out on top… but are we just doing a ‘tax report’? Or are we looking toward a ‘mega-report’? Or refactorring the existing tax report? What is in and out of scope in a ‘tax report’?
But you are right - tax came out the top priority and I didn’t make that clear.

This is so so good. Awesome work, excited about meeting next week :slight_smile:

Great work!

About infrastructure: Having one central server with all data is a security risk. Technically it would be very efficient but I find that quite scary.

Interesting. Why is this scary @maikel? Because a hacker could get international shopper data? Addresses etc.

Yes, one server containing data of all instances is much more attractive than one server containing a fraction of that data. From the top of my head, data prone to misuse are all the keys for Stripe, Paypal, emailing etc, then all the personal information to impersonate people or send very targeted phishing emails. Having ten times more data means ten times the impact if it gets hacked.

wow @lin_d_hop that was amazing! :clap: I can’t add much that you didn’t mention. I share exaclty the same vision. The solution to our needs involves these three pieces.

I’m not clear if that’s what you meant @Matt-Yorkley but I’d like to stress that whether or not we render reports server-side or client-side should be a smaller detail. If we go server-side I’d like to share 90% of the code that feeds the API. If we go client-side instead, chances are that we’ll need purpose-specific endpoints that return that data in the format that best fits the UI.

Looking forward to the meeting :heart_eyes:

Take a look at the POC! It’s fully working (currently for one report) and can be accessed from the API or from the reports section in various formats. The code is 100% shared between the API endpoint and the in-app rendering methods (which are twice as fast as they were before): https://github.com/openfoodfoundation/openfoodnetwork/pull/5576

2 Likes

Product vision meeting notes 23/06

Thanks everyone for attending the product vision session which helped us start the priorities and scope to help define roadmap of the 3 routes (In App, BI, API)

Please find a summary below

Final high level summary of prioritisation

  1. Tax report - In App
  2. BI - same servers, different databases
  3. In App (excluding tax report) more analysis is needed
  4. API more input needed from DFC

Tax report

Scope of tax report:

  • Agreed would be in app
  • Orders and order with line items
  • Aggregations: Customer totals, supplier totals, Fee/Tax/adjustment totals, product total, variant totals.
  • Columns: Split values into tax itemisations etc
  • Sum by order for suppliers
    • Order by suppliers (by line item?)
    • product strategy for tax report

Tax report notes

BI (including Product infrastructure)

Options discussed :

  • Super admin only for each instance separately
  • Replica DBs (for each instance)
  • Replica DB that aggregates instances
  • Replicate OFN Enterprise permissions for direct user access

BI notes

  • Decision: Same servers, different databases which allows instances to get data they want (change password)
  • Link to previous roadmap
  • Discuss this in product curation on 30/06
  • Action: Wish list/epic created for these
    • there’s some timestamp parsing issues currently that should be fixed when we move to Rails 4, so maybe we should wait for the v3 rollout
    • the first step is provisioning a new Kafka/Metabase server on a bigger droplet. This is already automated, so it’s a small job, but it needs to be done first
    • the proposed roadmap is now to expand the replication and Metabase stuff to include France, Aus, and Canada

In App

Options discussed:

  • Refactor all existing reports and resolve consistency issues
  • Mega report with the flexibility to show/hide fields for orders with/without line items

Considerations:

  • Is a mega report realistic? Is it possible to maintain something so flexible?
  • Do we need to keep the existing reports? Can we group, cluster and remove superfluous?
  • Performance and resource issues will be key

In App notes

  • Get ‘simple’ requirements from tax report provided first, review and have a follow up

API

Options discussed:

  • Just refactoring existing reports as per Matts POC
  • Redesigning the json outputs for orders w/out line items

API notes

  • Wait for DFC insight

Summary of Actions and next steps

In App Actions

BI Actions

  • Lynne to create wish list/epic for BI Server (FR, AU, CA)
  • Discuss in product curation on 30/06

HH to create first draft of project roadmap

HH to create draft wider product vision for product team input and review