Data insights in OFN

sauloperez · July 17, 2019, 4:09pm

Hi there!

This are some ideas that stem from the discussions that took place in the gathering and the session we had yesterday morning with Rachel and Dani in Barcelona.

I’ll try to describe what is it that we want to achieve piece by piece in a way that everyone understands.

Need

Local and global business intelligence

This is the need we have. At local level, we want to answer all sorts of questions around the usage of the platform: how many organizations where created in the last month, total turnover of these organizations, how many user hubs have, etc.

Also, we need to know how much we have to charge our customers based on their membership and turnover each month effortlessly.

At a global level, we need to answer these questions with a single global view which could help in future funding applications.

Alerts for OFN instances

We have also the need to know when a new profile gets created in OFN as well a shop set up as this would mean we have a new customer to charge in the next billing cycle.

We need to get notified instead of us having to check periodically as this requires extra effort and it’s error-prone.

Solution

Metabase

This open source business intelligence would the insights we listed above by means of its core concepts questions and answers. The former is any data we get from a database table using filters, agreggations and grouping (aka. SQL queries with an awesome UI). Then, answers can be visualized in many ways: numbers, line charts, funnels or even maps.

The other two powerful concepts built on top of questions are:

Segments

a special named filter or set of filters. You can create these to make sure that there is an official definition of a subset of customers that everyone on your team can refer to consistently. For example, you might create a segment called “Active Users” that contains all the filters that are needed to correctly select active users.

Metrics

an easy way to refer to a computed number that you reference often. This could be something like revenue, which you could arrive at by viewing the Sum of your Order’s Subtotal column, and applying any required filters — but it’s a pain to have to do this every time you want to do something with revenue.

Then, we can arrange questions based on agreed segments and metrics in meaningful dashboards and collections. We would share segments like active users, producers and metrics like turnover.

Lastly, Metabase also has Pulses which are either email or slack notifications based on saved question.

This way we can set up Metabase to know how much we should charge our customers each month like I start doing below. Generating invoices is something that only Zappier covers and Metabase won’t.

I want to be crystal clear on the fact that to use it we need to be familiar with databases. There’s a lot that Metabase abstracts but still, it is all around database tables and the information each one contains and therefore, a basic understanding of the data model is also necessary.

Database read-only replica

Aka. having a separate database server that is in sync with the main one but which we can only use to read data from. This time, shared amongst all instances, so we move towards a single infrastructure.

Why? because it will enable Local business intelligence and Global business intelligence by means of Metabase but also:

Generate reports without causing trouble to other users as they’ll fetch data from this separate read-only replica. This should also bring some speed improvements.

From an architecture perspective this looks like

Pilots

To move this forward we suggest the following pilots

1. Katuma starts using a global Metabase instance

With this we want to prove that Metabase covers:

our immediate data per shop needs, now that we have pricing plans and new hubs as members of the coop who need to be charged.
our need to understand whether our efforts on user acquisition pay off or not and the overall growth of the instance.
the necessary alerting needs so that we get notified whenever new shops appear on the platform.
the need to know how much to charge our customers each month.

Why Katuma? Australia and UK have these needs somewhat covered with the current Zappier solution while France still has some room as they will start charging customers in September at the earliest. Other instances still don’t have defined pricing plans. Am I right?

This requires the aforementioned database read-only replica and needs a tech solution.

2. Onboard France to Metabase

With this next pilot we want to prove that Metabase also covers:

The need to have consistent data across instances which allows comparing data between instances
single view of global data, which has a big impact on potential funding applications.

Beyond here

We believe that Metabase does cover our data and alerting needs but it does not help us automating the per-shop invoice generation that all instances need which we think involves integrating Zappier with our accounting solutions.

Then, as a separate topic we could make reports fetch the data from this database replica as mentioned above.

Finally, there some privacy concerns that need to be answered: where this Metabase server will be located, whether we need to mask any personal data such as phone numbers, names and stating this data collection in our privacy policy.

Tech

The necessary discussion around technical solutions needs to happen after we agree on the whys and the pilots, not before.

@danielle @Rachel is there something else you want to add?

Kirsten · July 23, 2019, 5:46am

Just as an aside, I haven’t yet set up Zapier for invoicing in Aus. Basically because there are only 20-30 of them monthly and at this early stage I quite like the manual handling to make sure I’m across it! That will soon fade and automation will be ace. Is there any reason that zapier couldn’t pull info from metabase? Seems like the database query approach would work just as well to metabase as it does to ofn itself

danielle · July 23, 2019, 9:18am

I don’t see why not but I’ll leave it to @sauloperez to confirm.

sauloperez · July 23, 2019, 11:04am

There’s no way Zappier can pull anything from Metabase because it’s just a web app that shows a bunch of graphs based on queries that we build. It’s not a database. I even check and there’s no Metabase-related app in Zappier.

As you can see in the screenshot, both Metabase and Zappier will pull data from that separate database replica.

sauloperez · August 1, 2019, 2:47pm

I like to hear opinions and discuss in the next delivery train meeting @luisramos0 @maikel @kristinalim @lin_d_hop.

I will also ping you @Matt-Yorkley to understand what have you been playing with so far.

luisramos0 · August 1, 2019, 5:22pm

Thanks Pau for taking this forward!

Matt on slack a month ago: “I have a working implementation for this idea, It’s using datastreaming via Kafka”

this is really exciting!!

I dont think this is going to be a small piece of work. Can we try to estimate? Setting up the kafka processes and having metabase running will take at least 2 weeks dev time.

Can we clarify priority of this work?
As far as I understand this task was not placed at the top of the backlog in the gathering: Global Gathering - Day 8 - Pipe Prioritisation - DEV PIPE
I think this was at the same level as a few other around the center of the quadrant… was it?

Matt-Yorkley · August 1, 2019, 7:52pm

I can clarify a bit there… the work is done, and this is already running, with database changes from UK Staging being streamed to a server running Kafka, with the changes being applied to a corresponding replica. Metabase is already set up on it as well, and configured to read from the replica.

It needs some discussion about implementaton that we were supposed to have the other week, but we didn’t get around to it. I would have looked at it again last week, but I was a bit distracted!

Basically we can choose which tables we want to replicate rather than doing the whole database, so for instance we could leave out the users table, addresses table, etc. Partly so we don’t move personal data around, and partly to minimise the amount of data the central server has to process.

So as a POC, it’s live and working. The code is sat in the (private) ofn-data repo. It’s 100% Ansible managed, so we can deploy it anywhere, with reproducable results from a couple of playbooks commands.

Yeah, lets discuss it!

Edit: there’s other technical considerations as well, like if we start doing datastreaming from all instances to one server, and then we start running Metabase, Zapier, and queries from the app for reports on those replicas, we don’t know at what point the $20 server will melt.

Also me and Pau seemed to have very different ideas about how many of our DB tables we needed to replicate.

One last thing, I feel like maybe we should get some of the things from this week’s new repo in place (and stable) before we start doing this. We should have really good security and monitoring before we create a data warehouse…

Matt-Yorkley · August 1, 2019, 8:05pm

Yes, you’re right. I worked on it in my spare time because it was really interesting.

This probably deserves the Pipehack Alarm Gif™ as well. That’s the second time today…

Matt-Yorkley · August 1, 2019, 8:20pm

And another thought… the current setup supposes a single-node Kafka cluster on one server, but if we start depending on it for production functionality (we shouldn’t do that to begin with) then we would probably need to look at a multi-node Kafka cluster. It’s not complicated, but we’d want 2 minimum.

And at some point I think we would really need something like this monitoring setup: UK's downtime retro

luisramos0 · August 1, 2019, 9:32pm

wow Matt, amazing stuff! I didnt know you had already done the kafka work to get the db replicated. I thought you were giving it a try! That’s incredible

ok, it’s good we have it clear this is a pipe hack. I say this because this is an awesome pipe hack, like tech and product, everyone wants this! We just agree we need to get Performance and Mobile out of the way first.

In terms of the topic itself, you guys rock, all aligned for a nice solution. I think will do fine with one single server for all instances and scale it up if required later.

sauloperez · August 2, 2019, 2:33pm

awesome @Matt-Yorkley! Totally agree that doing that now is a pipe hack. To begin with, I’d like to know whether we agree on the pilot we’re proposing.

Then, I’d like to see how can we make this fit with our local agenda’s. I have Anna pinging me to know how to get the data to start charging our customers.

Screenshot%20from%202019-08-02%2016-30-44

Is there a way perhaps to start using what you implement @Matt-Yorkley internally in Katuma outside of my OFN committed hours so we don’t commit a pipehack crime?

Currently I only have a docker container running in my laptop and importing a production database dump whenever I need to check numbers. I don’t want to waste any more time patching temporal solutions.

maikel · August 6, 2019, 2:24am

@sauloperez This is what we use in Australia so far:

gist.github.com

https://gist.github.com/mkllnk/9be5e677bffb65141998d9ff9925ecc1

monthly-sales.sh

#!/bin/sh
#
# I can execute this:
#
#   ssh ofn-prod './monthly-sales.sh `date --date "-1 month" +%Y-%m-01` `date +%Y-%m-01`' > "OFN-monthly-sales-`date +%Y-%m`.csv"
#

if [ $# -lt 2 ]; then
  echo >&2 "Usage: $0 2018-09-01 2018-10-01"
  exit 1

This file has been truncated. show original

We are thinking of using similar SQL queries in Zapier to automatically send invoices to our users. But we are still ironing out our communication and discussing around billing in general.

sauloperez · August 7, 2019, 1:18pm

and how do you provision that and where @maikel? manually to the AU prod server?

Unfortunately, that doesn’t solve our immediate needs at Katuma since it requires server access and CLI skill Anna does not have (and I don’t intend to teach them).

I guess I’ll see how to get a docker container up and running a cheap server and get going.

maikel · August 8, 2019, 12:41am

The script was manually copied to the server. It’s run my cron once a month and the CSV file is made available under a certain URL for the team. No need for CLI access there.

I’m not saying that it’s a good solution and that you should do it as well. I was just sharing what we are doing. So if somebody wants to set up Zapier or something better, we don’t have to write those SQL queries from scratch and do the same mistakes around time zones and cancelled orders again.

sauloperez · October 30, 2019, 4:16pm

Hi everyone!

Time to share some updates and move this forward. Last week @matt and I had
a catch-up to get on the same page and see how to proceed now.

We touched on the architecture of the system so far and understood its challenges. I won’t go into details here but he explained it very well at https://github.com/openfoodfoundation/ofn-data/wiki.

These are the things that stood out:

We need to add the security stuff @matt worked on before going to production
There’s some concern about the resource impact this will have in all instances. We need to try and monitor.
The current setup is ready for prime time. No big extra tasks need to be done. We might only need to add monitoring to this new server.
We might need a separate purpose-specific server (extra RAM and better disk?) to host the DB replicas pretty soon.

The single aggregated database

We came to the agreement that besides the per-instance read-only replicas what we need is a single read-only database that replicates data from all instances together so that we can have the global perspective of OFN as a single entity and start gathering relevant metrics at many levels.

@Matt-Yorkley thinks Kafka connect (the underlying tech) could fill this need as well but this something wasn’t yet implemented and will need investigation.

Updated roadmap

What we did so far

We haven’t moved much the proposed pilot above. We did the following.

Connect Metabase to Katuma’s production DB (exactly what we wanted to avoid with the DB replicas…) to start gatering billing information.
Prepare some basic usage dashboards we used to share metrics with the Dimmons research group.
Connect Metabase to UK staging with a somewhat old production DB dump to start playing with metrics.

What we propose from now on

The sudden crisis at Katuma due to the lack of instance manager (working on it) makes it a less good candidate to pilot.

Also, due to concerns about the extra load, all this might add to our servers we think is best to choose a standard 8Gb RAM instance server (Katuma is still to be upgraded).

So, to summarize, these are the next actions (in order):

Update all dependencies before hitting production https://github.com/openfoodfoundation/ofn-data/issues/7
Change Datadog’s in-depth performance monitoring (Called APM) from the UK to FR. We pay for a single server.
Provision FR production with ofn-data.
Keep a close look at the monitoring to spot performance implications.
Onboard and help @Rachel get acquainted with Metabase and report insights back to the team.
Check whether Kafka Connect can replicate from multiple instances to a single database.

Of course, we need @Rachel’s and @lin_d_hop 's agreement first, at least.

Anything else you might want to add @Matt-Yorkley ?

If there’s some consensus, I’ll prepare issues to add at least one to the next delivery train.

Rachel · November 6, 2019, 4:14pm

I’m ready for the next steps