Reports Project

Matt-Yorkley · June 1, 2020, 8:56am

Hey Devs!

We’ve got a scheduled meeting tomorrow to discuss the overarching tech strategy for Reports. Basically we want to come to a clear vision of what a new and improved reports system would look like in terms of the technical implementation.

Some broad objectives to consider:

Clearer reports code that’s easier to maintain
Better separation of loading data and rendering data
Greater flexibility in adding new reports or customising existing reports

Some relevant reports features being discussed:

Reports that can be customised by the user (adding/removing fields etc to generate the report as they want it)
Improved UX around per-user marking of “favourite” reports or per-user preferred default settings
Reports data can be pulled from an API endpoint for use in external tools
Reports can be exported in a wider range of formats
Better test coverage of reports

Some implementation ideas we’ve been thinking about:

Doing a spike on available reporting gems that we could lean on to make the code simpler / easier
Moving reports towards the API. Fetching and arranging reports data could be done on the API side, serialized cleanly, then rendered separately in whatever format is needed

luisramos0 · June 1, 2020, 12:09pm

Interesting!!!

Metabase

I see this as a BI challenge, that’s why I can only think of Metabase, we really need to leverage it’s power now that we have the setup basically done. This is something inevitable, we just need to decide how much we invest in it vs investing in a custom reports solution like the one we have.

The main decision is: what users will be using metabase: instance managers and admins will obviously be using it. Will enterprise managers have access to metabase?

One of the major challenges with reporting (and metabase is not different) is data level authorization (show users only the data they have permissions to see). This is already for us the main challenge for performance optimization in the existing reports (we did some good progress recently).

In metabase, we can embed this authorization layer in the SQL (like we do in the current reports) BUT I have seen many different approaches to this problem.

This is just an example but I can think of one case where it was done in the ETL/data streaming layer with incredible success and simplicity. Basically, instead of having one DB from where metabase reads, we can create enterprise specific DBs and make kafka write live data to these databases according to what enterprise the data refers to (same data can be written to different DBs if that’s something needed). This can sound a bit hard to do initially but the power and flexibility we get from it is amazing: with these enterprise specific DBs we would have no data authorization to handle and we could use metabase to build whatever reports/dashboards the users need. We would need to think about a way to simplify the authorization models so that this is feasible. We will need to do this simplification anyway otherwise reports will always be an authorization nightmare (that class we have OrderCyclePermissions is just nuts).
Well, this idea is not yet very mature in my head but I think it definitely deserves to be considered. I dont think these enterprise specific BI DBs is the only way to go, we can find alternative approaches.
I have seen a lot of reporting and BI tools in action and I know how powerful for the business it can be to have something like metabase available to the managers.

Custom reports

I am strongly tending towards solving this problem with metabase/BI and not with more troublesome ruby code that renders specific reports and always needs a dev to fix problems or improve things. But if we decide metabase is only for admins, we need such a solution.

I am not sure we need/should rely on the API for these reports, the datasets needed for these reports will be specific and large, I am not sure what’s the advantage of having reports as part of the API.
I am not sure but I think it’s better to separate reporting and API requirements and solutions. The reports can be exported to json but I see reports as particular representations of aggregated data while the API has a standard representation of basic entities (products, orders, enterprises, etc). In that respect, external systems do not need reports to integrate with OFN, they need good API endpoints with all the data they require.

@Matt-Yorkley I am not sure about the gems approach but I think that’s because I dont understand what the gems could do for us. I think the gems will differ considerably in approach? Maybe a spike would clarify that

Matt-Yorkley · June 5, 2020, 1:21pm

Ok, so we had a good discussion the other day with a lot to think about. I’m not sure we’re arriving at a concrete plan yet, but we can discuss the outline a bit more. I’ll add the three main use-case categories here, which I think is a good starting place. I’ll summarise without the pros and cons for now.

Use-Case Categories

“IN-APP” REPORTS

This includes things like packing sheets, tax reports, products sold per order cycle, etc. The reports should be able to be rendered as HTML in the browser, as well as exported in multiple formats like plain CSV, ODT (OpenOffice), XLS and XLSX (Microsoft), etc.

DATA EXTRACTION

Improvements to the API could bring a lot more oportunities for improving integrations of all kinds, for example; automating user billing with Zapier, exporting enterprise sales data to external accounting software, etc.

BUSINESS INTELLIGENCE

This would include custom reports and data visualisations, and could be used internally by instance managers as well as by enterprise users. In terms of BI as a product, this might be something aimed at a small subsection of enterprises that have larger turnover and more complex business models, provided as a paid service.

tschumilas · June 7, 2020, 1:19am

OMG - it sounds like heaven!

Matt-Yorkley · June 7, 2020, 8:36am

We had some initial discussions about tech strategy in terms of implementation. Some of the key talking points were:

Overlap of use-cases

In the three identified use-case categories there is a lot of potential overlap. We are possibly looking at multiple solutions for these three cases instead of one single solution.

Using Metabase for all aspects of reporting

Summary:

Proposal to use Metabase for all reporting needs.

Pros:

we wouldn’t have to rewrite / maintain in-app reports

Cons:

would require some way of re-implementing OFN permissions logic within Metabase, which could be very complex
depends on the Kafka-based datastreaming setup which is currently in beta
might not be possible to cover all reports requirements outside of OFN business logic

Shifting reports into the API

Summary:

Shift the reports output to API endpoints and have the in-app reports consume those endpoints when generating reports.

Pros:

could improve in-app reports and increase available data-extraction options at the same time

Cons:

potential increased resource use?

Pointing reports queries at replicas

Summary:

Generating reports is resource intensive and currently hits the production database. We could point the reports queries at (read-only) database replicas to improve this.

Note: this option is not mutually exclusive with other proposals, it’s something we can consider alongside any other options.

Pros:

reduce load on production databases, improve timeouts

Cons:

depends on a replication system, either with Kafka-based datastreaming or another option

Matt-Yorkley · June 7, 2020, 2:16pm

Reports Spike: “Shifting reports into the API”

I did a little spike on this option to see what it would be like. I picked a report at random and made a quick POC roughly as follows:

made a new API endpoint for the report
fetched the relevant data
passed it to a new serializer, with an attribute for each column in the report
made a quick Angular service that queried the endpoint and displayed the results in a table
did some before and after performance comparison, loading the same dataset with both the old and new methods

Good Results

I found a really nice performance improvement by accident that affects multiple reports, submitted a quick PR
It seemed like the code could be made much cleaner and simpler, and the controllers a lot slimmer
Running it through Angular would potentially open up more UX options…?

Bad Results

The performance was significantly worse, with total page load time around 25% higher. This is probably due to the added rendering layer; essentially rendering the data as serialized JSON then rendering the JSON as HTML added too much overhead.

Conclusion

This probably isn’t a good idea. The spike was useful though.

Matt-Yorkley · June 7, 2020, 3:59pm

Reports Code Analysis

I did some general analysis of the existing reports code, looking at where the issues and complexities are and what would be required in a refactor/rewrite.

General Issues

the classes that define each report are scattered all over the codebase (it’s incredible how many different locations they reside in)
in some cases the structure and methods used are quite different from one report to the next, there are various approaches and styles
a lot of the code is obtuse and difficult to follow
the reports are quite inflexible, so for example if we wanted to add a nice UX feature like the ability to choose which columns were exported (show/hide) when downloading the report as a spreadsheet, it would be a nightmare
the code around permissions needs improving. It’s not reports-specific, it applies to the whole app

Grouping and Summarising

The primary point of code complexity is in the places where we apply rules for grouping results and adding “summary rows” for those groupings. So for example, a report might list line_items grouped by order, and after each set we inject a row that shows some summed/aggregated values for selected columns, like quantity total, before showing the next set, eg:

---------------------------------
| order   | item     | quantity |
---------------------------------
| order1  | carrots  | 5        |
| order1  | broccoli | 3        |
---------------------------------
| TOTAL   |          | 8        |
---------------------------------
| order2  | carrots  | 2        |
| order2  | parsnips | 1        |
---------------------------------
| TOTAL   |          | 3        |
---------------------------------

This “grouping results and injecting summary rows” is a key part of the report-generating process. If we can find a way to do it really nicely and cleanly, we can make everything a lot prettier.

Masking Customer Data

We have some really nasty code for conditionally hiding certain customer info in a few reports based on permissions. Summary of the current state:

The code for hiding customer data is really awful in terms of performance and gets run in every single report that queries line items (most reports).
There are only actually 5 reports out of 27 where customer data even potentially needs to be hidden.
Of these 5 reports, in some cases there is only a single field hidden. In other cases there are also other fields present in the report that should probably not be shown to these users with limited permissions either, but they are shown.
There are two separate versions of this same code for hiding the customer data, kept in two different places (see Reports::LineItems and OrderAndDistributorReport).

As with the “grouping and summarising”, if we handled this bit nicely we could could clean everything up and radically improve performance at the same time.

Broad Objectives

Report controllers should be slim and clean
Report classes where we define what data should be fetched and how the data should be grouped and aggregated (including “summary rows”) should be: slim, easy to read, easy to change and easy to create
Once a report object is created, it should contain a nice representation of the resulting dataset which is easy to manipulate and easy to render/convert to a broad range of formats
We should be able to render that object directly to (for example): a HTML table, a CSV spreadsheet, an Excel spreadsheet, an OpenOffice spreadhseet, JSON, etc
For UX purposes the flexibility of the report object is really important

luisramos0 · June 8, 2020, 5:55pm

Nice work Matt.
I think that “Refactoring the existing reports solution” is a 4th option (not option 1 metabase, not option 2 API based, and maybe not even option 3 connect them to a replica) and deserves it’s pros and cons.

Refactoring the existing reports solution

Described in details above by Matt.

Pros

Flexibility to build the exact reports we want, specially in terms of UI (I am not sure how flexible metabase is in terms of exporting data for example)
Re-use existing biz logic already implemented for existing reports ( performance challenge)

Cons

Highest maintenance cost of the options available with no re-use of API code and no re-use of UI features of metabase

Other options

Regarding the other options, I can add a few more pros and cons to the 3 options:

2 more pros for using Metabase for all aspects of reporting:

NEW we would rely on a standard open source reporting tool getting their regular features and updates
NEW we would get all the metabase features like the visualizations to the enterprise users

2 more pro for Shifting reports into the API

NEW API would be the single point for implementing and simplifying the logic behind reports data
and more 1 con:
NEW We would need to get hyper fast json rendering which would be a good thing for the API as a whole

1 more pro for Pointing reports queries at replicas

NEW decrease risk of reports crash the app with it’s load

mbudm · June 9, 2020, 1:17am

This comment by Matt made me wonder. If there is a strong demand for UX tweaks of reports by users then is it a viable option to keep the system component that generates the report data quite ‘dumb’ and then pass some of the complexity of report customisation on to the client?

I’m thinking that examples like this grouping of summary rows could be quite performant to do at the client, such as on screen report viewing (just the way you like it) or downloading as PDF/CSV, manipulating in Metabase.

If so this makes me lean towards the OFN API. It’s job is to do the basics, provide (Extract) data that pretty much maps to the OFN data models. Other tools/clients then transform/load that data as needed, and any performance issues with transformation/load are the reposnsibilty of the report client.

Kirsten · June 9, 2020, 3:09am

I recalled agreement that one of the ‘exploration’ tasks was to define the useful ‘mega-report’ and investigate building from scratch one nice new clean perfect in-app report, that provided us with (e.g.) line items with correct taxes and fees . .

just re @Matt-Yorkley’s in-app investigation above and @luisramos0 comment about in-app report refactoring being a 4th option - it looks to me like it is a strategy investigation under option 1. My notes had it structured like below. I hadn’t put them in because they were the same as Matt’s Use-Case categories, but now I think this structure might also be useful

1. “IN-APP” REPORTS

Strategy investigation:

a) new report/s that are built beautifully and give us the key data we need - @lin_d_hop to define / suggest?
b) do from database replica rather than main database
c) gems
d) ‘re-factor’

2. DATA EXTRACTION

a) API
b) other formats - odt etc
c) ?

3. Business Analytics

a) Metabase - Kirsten exploring, others exploring - Q. spend half day to get Aus data set up so that we can closely investigate how easy it is for us to build the reports our customers want (much easier with familiar data). ping @chez
b) replicas - Matt building roadmap for getting effective replica set up and hooked in (to be considered at next product curation?)

Matt-Yorkley · June 9, 2020, 10:08am

Reports Spike: Available Report-Building Gems

I did a quick spike on Ruby gems that could possibly help with the issues outlined in the Reports Code Analysis above, including things like:

handling some of the more complex aggregating/summarising functionality more cleanly
allowing for more readable and concise report definitions
allowing for easier manipulation of the report that might be useful for UX functionality, etc.

Conclusion

There were a few interesting frameworks/solutions, but they were generally not very lightweight, a bit over-engineered, bad on performance (I did some quick experimenting) or provided nice functionality but in a very limited way.

So it doesn’t look like there’s a gem that will adequately cover our reports needs.

Matt-Yorkley · June 11, 2020, 8:20am

I made a quick POC on what a reports rewrite might look like, based on both the Reports Code Analysis and some inspiration from features I saw when looking at report-generating gems (although none of the gems ultimately seemed like a viable/usable solution). The POC looks like this:

This was more of a “glass-of-scotch-and-some-leisurely-hacking” kind of thing at the weekend, but if we want to go in this direction I could do some actual work on it.

TLDR; it works really well, covers all the objectives in the Reports Code Analysis, and I think it would enable all the requirements outlined in Harriet’s analysis to be completed really easily.

Matt-Yorkley · June 11, 2020, 8:38am

Just a quick opinion on the “do everything in Metabase” option:

I think it would be unfeasible to implement all of the required permissions and other logic we would need inside of Metabase in order to cover all the reports requirements
If we went in that direction I think it would actually be incredibly expensive and we would not have anything that was production-ready for a long, long time…

I do think Metabase is awesome for certain things and certain cases, and we should explore it more, but I don’t think it’s a viable option for reports as a whole. If you want to see things like quarterly sales data visualised in a beautiful graph plotted against last year’s data (visualisations of broad queries on raw data), Metabase is incredible. If you want an Excel packing sheet for an order cycle, with per-user and per-enterprise permissions and various bits of app-specific business logic applied, Metabase is not the right tool.

lin_d_hop · June 16, 2020, 5:05pm

Based on everything that has come up so far we seem confident that we have 3 ways to access OFN data that we wish to support:

In App
Business Intelligence
API

From the great tech spike work that @Matt-Yorkley we have lots of valuable insights. In particular it seems clear that we need to follow all three routes in order to satisfy our report requirements and that no one of these will be the one that rules reports totally.

So in order to create a Roadmap for Reports we need to understand the priorities for each of the routes.

In App

Looking at the summary of requirements it seems that we have a few ways forward for In App reports:

Focus on refactoring existing reports one by one, improving consistency and fixing bugs as we go.
Start the refactor with a new report to solve the pressing enterprise fees and tax issues. This will serve as a trial for the reports refactor and we can bring other reports in as we go.
Start the refactor by enabling a ‘mega-report’ that solves most reports needs and forms the basis of the reports refactor by rendering other reports obsolete. The mega-report could replace most reports if it enabled:

Querying orders with and without line items
Includes all the key search fields from reports - order cycles, distributor, shipping/payment methods, date range
Sort by Distributor, Producer and Shopper
Totals and aggregation
Hide and display a million potential fields

It is likely that an approach of developing a new report will be lower risk as we can leave the existing reports alone while we do it and experiment with a new reports approach.

EDIT - Note that In App reports can be available via the API in the refactored form that Matt has identified with no extra work.

EDIT - To be clear the priority reports problem to solve is Tax… though the questions remain about the scope and product strategy of this report.

So questions:

If we go for a new report what is the scope of it?
What are our priorities and scope for the first iteration of updates for In App reports?
Is there a design task to bring consistency to report field conventions and naming that should be done before a new report?
Or would it be better to design a new report and then clean up the old reports afterwards?

Business Intelligence

It is clear that we have a number of large users that need bespoke reports to solve their requirements. We’ll never be able to satisfy all our users specific needs when it comes to reports and BI so using an external system built for this is a great solution. Go Metabase!

With not much work we can set up metabase for super admin access on the same DB that the app uses. This is fine for some cases can gives superadmin more of the power they need but it does not solve all the problems.

Ideally we want users to be able to access all the power themselves. But this means recreating the OFN permissions logic in Metabase and that logic is complex. There are infrastructure challenges inherent as well - see below.

So the questions here:

Where does the Metabase roll out sit in the reports priorities?
Does it make sense to create super admin only Metabase for each instance as they need it? Eg France, Aus, Katuma all have their own Metabase on their app DB.
Would it be safer to create duplicate DBs for all instances using Metabase?
Are there bigger gains if we instead focus on aggregating duplicate databases so we don’t have to duplicate work in Metabase?
Or is it a higher priority to get the OFN permissions system into Metabase to enable users to access?
And how does this sit alongside the other reports priorities in our Roadmap?

API

From the investigations is have become clear that just ‘putting reports on the API’ posed problems with performance, both in terms of our json rendering (its slow) and in the complexity of aggregation. (EDIT - In reports refactoring Matt has shown that it is technically possible to put all reports on the API efficiently). If we need to make an end point for every report then the actual data gains found might not be all that worthwhile.

The primary data extraction needs via API are Orders with and without line items. The API already has endpoint for these, imperfectly. The great work done here suggests that almost all the search parameters we could dream of will filter correctly. However do have a bit of a mess in the responses - in particular adjustments, fees and taxes are a mess. We know that we need Bye Bye Spree before we can refactor the adjustments code.

There is much to be done to understand the data structure that we want for our json outputs. Fortunately the API is already out of Spree (hooray!) so we can being this work as we wish.

Questions:

Where does the API work sit in our priorities?
Is there a design process that should be starting earlier to understand our real needs from the API?

Infrastructure

As data extraction and reports put a high load on servers there is much talk of replica databases. One proposal has been to have all instances join together their data on a single replica DB. This offers huge potential advantages when it comes to integrations, allowing all instances to benefit from Zapier integration, Metabase reports and other API integrations without reinventing or needing their own tech team.

So, questions:

So from a product perspective is this a direction we want to go?
Do we want to prioritise this before other work?
What else do we need to know from a technical perspective that will help to position this work in the priorities?

Shout out to @Matt-Yorkley for the absolutely awesome technical investigations that have helped us understand where to go with reports!

Seems to me that we are now at a place in which we want to create a Product Roadmap for the reports work so that we can understand the priorities within and between the three streams and start to plan the work.

This is all intended to be food for thought for the reports meeting on Tuesday 23rd June. See y’all there.

Matt-Yorkley · June 16, 2020, 5:33pm

Just to clarify a bit on the API point; the initial idea was moving reports into the API, which would mean any in-app reports (on screen or exported as spreadsheets) would have to consume that API before doing anything. This ended up being incredibly inefficient, as it went through a process of: fetch the data -> run it through serializers -> render it as JSON -> run it through Angular -> render the data a second time as HTML. The process added about 25% to the total page load, and it would have been similar for exporting to other formats.

The new proposal I’ve made would would have the same benefits of that initial API option (all reports would be accessible from the API, for no extra work) but the code would live in a slightly different place (not primarily in the API) and the result is that generating any in-app reports (on screen or as spreadsheet files) would be twice as fast instead of +25% slower.

Rachel · June 16, 2020, 5:36pm

Thanks for this great update @lin_d_hop

I thought it was reports on tax the first priority on in app reports? Did I missunderstood?

lin_d_hop · June 16, 2020, 6:24pm

Yes tax came out on top… but are we just doing a ‘tax report’? Or are we looking toward a ‘mega-report’? Or refactorring the existing tax report? What is in and out of scope in a ‘tax report’?
But you are right - tax came out the top priority and I didn’t make that clear.

Kirsten · June 16, 2020, 9:33pm

This is so so good. Awesome work, excited about meeting next week

maikel · June 17, 2020, 1:15am

Great work!

About infrastructure: Having one central server with all data is a security risk. Technically it would be very efficient but I find that quite scary.

lin_d_hop · June 17, 2020, 10:27am

Interesting. Why is this scary @maikel? Because a hacker could get international shopper data? Addresses etc.