Core/Non-core Instances and Sysadmin Support

Matt-Yorkley · October 1, 2020, 1:41pm

We’ve had a lot of new instances starting up since the last gathering and there’s been some previous discussions on the need to clarify how much support the core team should give these instances. This includes governance and budget decisions.

From the global perspective there are a number of points that have been raised:

Core instances contribute substantially to OFN’s funds
Non-core instances that are starting up don’t contribute financially
Our funds are limited
We have a small sysadmin team with limited time
Paid sysadmin time is expensive
We often volunteer time helping instances out with problems

We need to clarify what our plan is with sysadmin support, and what level of support we want to give and what the boundaries are, not just for us but also so we can be clear with non-core instances.

Things have changed a lot in the last year, so it’s also a good time to review things and see where we are now and where we want to be.

Matt-Yorkley · October 1, 2020, 1:56pm

There’s a lot of different sides to sysadmin support, but purely on the provisioning and deployment side of things we basically have two options:

Option A - We don’t provision or deploy non-core instances

sysadmins in non-core instances have to do these things themselves, adding a burden of time, effort, and required technical expertise to every non-core instance
these standard sysadmin tasks might get done in an ad-hoc manner, and by less experienced sysadmins
the likelihood of problems arising from minor user errors are exponentially larger
the core team can never really know what state those servers are in at any given time, which is a nightmare for diagnosing potential bugs/problems

Option B - we provision and deploy non-core instances automatically, along with the others

we remove a big burden from all non-core instances
this stuff is already automated! It doesn’t really add any additional time or work for core devs; it’s still just one command to deploy/provision all servers. I’m pretty sure the difference between deploying 8 servers and deploying 10 would be negligible, especially since these changes to improve Ansible’s efficiency when working with lots of servers at once
non-core instances wouldn’t really need to do anything in terms of sysadmin work
we reduce potential issues and problems from user error, things not being updated at the same time, etc, and reduce overall issues on all those instances
the core team can be pretty sure all the servers are in the same state

I think we should go for Option B!

This would probably need some re-arranging of our secrets management (which is currently awesome for core instances and non-existent/terrible for non-core instances). I think we can improve/extend what we have and automate the problems away quite easily so that for example:

non-core instances can have separate (private) secrets repos that core devs can access and the sysadmins from those individual instances can access (via GH permissions)
we adapt the current automated secrets-fetching (which is currently only usable by core devs) so that core devs can fetch all secrets, and syadmins from individual instances can automatically fetch their own secrets but not others

Basically the issue is that permissions with our secrets infrastructure is currently all-or-nothing, you can either access all secrets for all instances, or none at all. This can be easily resolved.

lin_d_hop · October 1, 2020, 2:37pm

In theory this sounds excellent and I would love to move in this direction. I do have some questions:

If we do this is there any difference between a core and non-core instance?
What happens if a non-core server goes down? Say due to an attack, out of memory, someone isn’t paying for the server? Does the core team investigate this?
Do we ask for anything from non-core instances for this service? Signing the pledge?
Shall all instances have some way of reporting to the global community once a quarter about the work they are doing? I would hate to be in a situation one day in which we deploy 16 instances but 9 are just unused. So how could we avoid such things?

These are the first thoughts that come to mind.

Matt-Yorkley · October 1, 2020, 2:46pm

I don’t have any answers to those questions, they all need some discussion and/or decisions!

luisramos0 · October 1, 2020, 3:30pm

Nice one Matt!

The different parts of the sysadmin job for a given server:
A - initial configuration, provisioning and deployment
B - regular provisioning (this needs to come with the task of resolving problems during provisioning)
C - deployment of new versions (this needs to come with the task of resolving problems during and after deployment)
D - resolve operational problems like instance size, disk space, load/performance, crash, etc
E - major upgrades: provider, OS or DB upgrade for example

If we go for just doing B and C for non-core instances, answers to Lynne’s questions would be:
1- core instances get A, B, C, D and E from global team, non-core get B and C only
2 - no, that would fall under point D

Kirsten · October 1, 2020, 7:03pm

The key question here is what defines a core instance. If these newer instances want to be included then why would they not be. Simple answer is because there is a cost of including them and global pot isn’t sure whether we should pay this when new instances have no money, and if we do for some should we do for all, and if not how we decide

This is an active discussion in global gov sessions and I agree that’s the right place to continue them (discussions and proposals)

Here’s some more info to shape development of a possible proposal

What’s a ‘Core’ Instance / Who is in the Global Sys Admin pool now

Are ‘core instances’ just those who contribute $$ to the global pot? Well no. Why is USA in our current group of ‘core’ instances? There is history on this, but ultimately because we decided that
A) the risk and opportunity cost of not having a ‘good-looking’ and useable instance in the USA was too high.
B) if we didn’t fill the gap someone else who wasn’t values-aligned was going to (was actively attempting to)
C) at the time this decision was made @lauriewayne1 was a very active member of the community and was clearly contributing to the commons in many ways even if not financially
D) the likelihood of significant funds being able to be accessed in the USA at some point is very high, and having a strong, stable and used instance there increases the likelihood of access to those funds
E) there was an element of marginal cost increase to just add one to the pool i.e. most of the setup work was already done - they had demonstrated high degree of commitment, gumption and capability

This last condition is now likely the same for Belgium and Germany, they funded their own setup and maintenance for a period, so it is legacy reasons why they are in this pool, rather than any particular strength or ‘core-ness’ to the community of these instances.

Let’s turn this into a proposal . .

Proposal - Request to become part of Global Sys-Admin Instance Support

We establish a process whereby an affiliate can request to become part of the global sys admin pool (similar to a request to become Certified Contributor). There are basically two pathways to being accepted for this.

Pay your own way - active financial contributions to the global pot from the outset. We need to agree and publish a ‘rates’ table for this, likely taking the economic and currency situation of a proposing instance into account
Can’t yet pay. To support their request, they should outline both their potential and demonstrated contribution to the Commons. This can include:

That the instance is available to farmers and food hubs in that country - they have managed to get an instance set-up on their own steam. This demonstrates commitment, a realistic understanding of the challenges and complexities AND a potential expansion of the global development capability
That the instance is accessible to farmers and food hubs in that country - there is a support person / process / team and potential users can contact someone and will be supported to setup and learn to use the platform
That they are actively building a presence and relationships with the agroecology and food movements of their country - they are connected and integrated and aligned with the needs of their local context
That they are aligned with and committed to the development of the Commons and are actively part of and contributing to the Community. The simplest way to demonstrate this is that we know how you are and you have developed relationships in the Community. We have likely come to know you through Global Hangouts, listening circles, the #instance-managers channel, well-specified and high quality bug reports, papercut suggestions and wishlist items. In order to have demonstrated this last point, you will need to have engaged with and understood the prioritisation and development process. NB. Occasionally dropping into slack and telling us what ‘you’ (i.e. we) should have prioritised or done or should do next does not meet this criteria
Genuine and competent efforts in fundraising i.e. links to actual completed and submitted fundraising applications. You have likely worked with others in the global team on these, accessing useful previous funding applications and incorporating global contributions into your fundraising in a way that aligns with established processes i.e. you haven’t just said “we’ll spend $20,000 fixing XXX” without any understanding of if and how this will align with the coordinated development process
?? not sure about this one - nice to have - indication that the technical capacity you drew upon to get this far is interested in / being redirected to supporting the development pipe i.e. picking up code reviews, papercuts, ‘welcome new devs’ issues etc

NB. I believe there are a couple of newer instances who could easily demonstrate their meeting of these criteria if they wanted to - note this is a voluntary process, not an expectation or requirement. If they are happy continuing to manage their own deployments that’s fine

TBC - Proposals for:

process to remove or ‘hibernate’ an instance from global sys admin

Comments here very welcome

and if people think this is on the right track I will move it into the google doc for the Global Governance - Discussions and Proposals google doc for comments / refinement prior to the session.

Matt-Yorkley · October 2, 2020, 3:53pm

Currently in practice we end up giving ad-hoc support on a voluntary basis, and I think this will generally increase.

Leaving aside things like server migrations and investigating downtimes (that have a real cost), if we did only the provisioning and deployment (I think India asked about this?) it would have basically zero cost for us, and I think overall it would reduce problems and the need for support (i.e. reduce the demand on our time, not increase it).

It would obviously be voluntary and by agreement, and as you say might only be for instances that show some level of dedication.

tschumilas · October 2, 2020, 6:28pm

I"m not disagreeing - but just adding things to think about…
It isn’t only about deployment. I observe that new instances also engage current instance managers and others in a whole range of help requests. It seems like there are a lot of questions/issues with configuration settings (eventhough these are highlighted in the super admin guide), transifex & user guide translations, general support on learning how to use the platform for the new instance manager, user support questions generally, questions about the community development (vs tech development) of running an instance… I don’t have an answer or position on these questions about core/non core instance support - I just want to point out that is not only about sysadmin support, its about all aspects of instance management/support.

lin_d_hop · October 4, 2020, 10:37am

Extending Luis’ list we kinda also have:

F: For other instances eg Belgium we’ve also included paid options like choosing S3 bugs, choosing papercuts and participating in voting.

Continuing from Kirsten above and including Luis’ options… what if…

… An potential new instance approaches us.

They have a two options:

Pay us
We could have different pay scales for A-F, based on the work already done with BE, perhaps to iterate as most seem to turn us down when we share those figures.
Request support
Before we support a new instance we ask them to complete a New Instance checklist that covers a lot of @Kirstens points and a lot of the things I’ve heard @Jen talk about.

Once a ‘Support Request Instance’ completes the checklist to our satisfaction we offer A, B and C from Luis’ list. Note they need to pay for servers… we need access but cannot be responsible for paying for servers!

We then continue to support them, asking them to perhaps fill in a survey about the struggles and successes they’ve been having on a regular time period eg every 6 months. Partly for our data and partly as their commitment. We can also monitor usage.

Option D-E might be something we offer once at a specific scale (as we then have a reputation cost of the servers going down)

Option F might be something we offer based on contribution - either in-kind or through funds to global pot.

Obviously this is quite a bit of work, but I wonder if we could achieve it by:

Paying for a piece of work to compile v1 of the New Instance Support Request Checklist.
Once in place perhaps the Instance Manager circle could be responsible to checking off new support requests? Perhaps if 2 instance managers approve then we include the new instance. This might also enable us to get past the potential language barriers and allow us to accept answers in more languages than English

Rachel · October 5, 2020, 1:13pm

Extending your list there is also everything that circles around the software: sendgrid account, stripe account, Zapier account, adding a new translation file… all these needs a sysadmin

I must say that I understand we would love to deploy to all instances at once, but as FR grows I’m increasingly lacking a sysadmin to help us locally. So I start to think that the global sysadmin pool was a great idea to standardize processes, but I wonder if it can scale.
Currently there are a lot of things I’m asking François to do because there is no place in our pipe for instance-specific sysadmin request.
I think we need to re-define our process around that. And if we don’t manage to include specific sysadmin request, I’m afraid it’s a signal that sysadmin is not a service OFN global is ready to offer yet…

luisramos0 · October 5, 2020, 1:44pm

Interesting! Rachel, can you give some examples of sysadmin tasks you have asked François to do?

I think the global sysadmin team can very easily scale and take more tasks. It just costs us more time/money. We do have the people and the skills to do it (all current core devs have a very good understanding of the servers and ofn-install).
For FR sysadmin I think it’s better to keep it at global level because then we (global team) know what’s going on there. It’s a risk in itself to have separate teams working on the same server. I have no clue what type of changes François does on the FR server (that’s also why I am asking about it above).

This is just adding to the conversation, I am not saying we better keep all of it at global level.

Rachel · October 5, 2020, 3:14pm

Just to be more precise: I was not undermining skills within the team I just wanted to underline that indeed it’s a problem of cost/money and prioritization. If we choose to do it as an ongoing task for all instances, it will have an impact on our dev pipe. We don’t have a separate pipe for sysadmin.
An example of request I’m currently using the pipe for: https://github.com/openfoodfoundation/openfoodnetwork/issues/4177 do we think it’s manageable to have all instances make their sysadmin issues prioritized this way?

I believe we don’t see it as much from AU and UK because @lin_d_hop & @maikel are absorbing a lot of these tasks in parallel of the pipe. Same with @sauloperez in Katuma. But maybe I’m wrong about this and it’s just in FR where we are not good at this

Here are the sysadmin tasks that I’m currently not comfortable to ask to the whole team (but maybe I should ) :

Users are asking us about:
where our data are located
where our backups are stored
how frequently we are doing backups
why we chose to set up our architecture this way and not another…
I don’t know how to answer those questions…
Defining how all the tools we are using can work together. Currently we are plugging in the instance data with our CRM, quickbooks etc. This was started by Malek, but we really need someone with tech skills to lead that part.
An example of question I have currently: should we have our CRM send emails with the same sendgrid API than the one we are using on the OFN platform?
Monitoring the production server: is this a routine devs do when doing a release? I remember we were schedule to increase our server space (or RAM only I don’t remember) at the same time as the UK did, and there was a problem because we needed to completely migrate our server to do it. A patch solution was found, and migrating was postpone to whenever a problem would again occur. Again I can very much be wrong about it. I don’t have the skills to be the memory keeper for FR on this but is it documented somewhere? Oherwise this is something I’m usually asking Paco about…
Specific question like our translation set up: should we add more translation like our users ask us to, or should we remove them and just leave FR and EN because that the easiest for now?
Discourse, wordpress and soon Nextcloud and maybe Loomio. Those are the other tools FR needs to maintain. We need someone with tech skills to do the upgrades and monitor server performances. I think now they are on different servers but I’m not 100% sure. So maybe we are doing good on this topic but I’m mentioning it as these are the type of tools other instance might want as well. How do they install them if they don’t have the skills in their local team? They hire the global team to do it? How to prioritized those paid tasks that are not linked to the OFN codebase?
Finally, a very punctual need François has answered recently was to help us change our domain. This of course is not a task we need to do very frequently, but it was very time consuming. If I had asked the global community for this, I believe it would have created tensions within the pipe. Do we feel this is something the global team could have done if specifically paid for?

Permakai · October 7, 2020, 10:21pm

I believe NZ is best classed as a non-core instance who values being part of the community and appreciates the occasional assistance with issues we cannot resolve ourselves.

We started on a shoestring and continue on a shoe string. We believe strongly in the need for connecting local producers with customers and that is why we chose OFN. The open source nature meant we could manage everything ourselves, and manage our own costs.

We monitor our own server, deal with issues of a non-software nature, occasionally ask questions in the forum, try and resolve issues before bothering anybody and appreciate the answers that are forthcoming.

We try to be as little trouble as possible and look forward to the day we can contribute to the global pot.

OFN is a great solution for an entry level local food organisation. Its pretty painless to install, the support is great, and the international community is one of the best I have come across. Unfortunately this means there will be times where this openness will be exploited. Its very hard to manage this and I hope a solution is found that maintains the supportive nature of the organisation while handling the disruption and overwork that too many calls on this support can bring.

hernansedano · October 12, 2020, 10:55pm

I think is important to really understand the problem before propose a solution. The initial question was: how much support the core team should give to new instances (non-core instances)?
In the @Matt-Yorkley initial post I can see the problem is relate to the available time of the sysadmin team:

“We have a small sysadmin team with limited time”
“We often volunteer time helping instances out with problems”

I guess the problem could arise because helping new instances is not part of the “official” pipe line, and at the same time the sysadmin team wants to help, so the result is work overload for the sysadmin team.
About this I find interesting what @Matt-Yorkley write " if we did only the provisioning and deployment (I think India asked about this?) it would have basically zero cost for us, and I think overall it would reduce problems and the need for support (i.e. reduce the demand on our time, not increase it)". As an Instance manager I found the provisioning and deployment option great.

On the other side I think that run an instance needs more components in place locally besides the platform deployment, it’s a start-up, so need management capacity to make it sustainable, and some times the enthusiasts are not conscious about that, so the risk is to deploy some instances without a real capacity or commitment to take it forward.

So the challenge is how to eliminate work overload of the sysadmin team and at the same time help to developed more sustainable instances that can contribute to the OFN commons?

tschumilas · October 19, 2020, 2:03pm

Agreeing with @Rachel’s questions above. I imagine we consider OFN-CAN a ‘core’ instance. Although we are not in the ‘big three’ (as its been referred to - and I"m not sure if that is big in users, big in sales, or what?). But there are many of the types if things that need a sysadmin that Rachel mentions. I also hesitate to ask. A simple example - I’ve had users who started entering products ‘manually’ and wanted to switch to using a csv upload once that feature was out of beta. I told them it was impossible - we can upload but not download. So they created new enterprises and started fresh with the upload. But now that we have Sean around as a volunteer, and we have given him access to the database, he creates downloads of the product list on request. And, now we give a download of the product list to any user who is leaving OFN for another platform too, to save their time. So - its a huge value added for us to have someone with this access. If an instance does not have someone like Sean volunteering - would the sys admin do this for them? Should I have just asked in the dev channel for this before?

lin_d_hop · November 6, 2020, 3:14pm

At this stage we don’t have the capacity to offer a service with bespoke CSV downloads and secondary tools that need a bit of tech skill from the global pool.

We do have the capacity to deploy new releases to servers, provision servers etc for a fee.

I feel like we can boundary the sysadmin work for the tech team. If people need additional support they can:

fundraise for their own tech person like FR and Francois
find a volunteer like CA
go without and make do

We can’t make the perfect solution to this problem right now but I don’t think that should stop us from offering some kind of solution with hard boundaries that will help unblock the major barriers for new instances.

Rachel · November 6, 2020, 3:44pm

fundraise for their own tech person like FR and Francois

FYi this is super tough to find additionnal support. François is volunteering.

I might be wrong, but I don’t think boundaries can work. It’s almost impossible to find someone external to the core dev team for the external bits. The volume of hours is too low and yet if you do sysadmin you need to be able to provide a constant watch.

We do have the capacity to deploy new releases to servers, provision servers etc for a fee.

if we did only the provisioning and deployment

On paper that sound great. In reality it is not. Emails, translations, integrations…

I don’t think you can provide a good product if you just have the OFN tool running on your server and nothing more. So maybe we do have capacity, but I don’t think this is helping instances in the end.

Capacity is something we can create by finding money right? Why not provide more support and make it clear that’s something that instances should maybe pay specifically?

I’m not saying all instances should pay right now upfront all costs, we can surely found a solidarity system, but I feel it would be best that our goal here was to be able to increase our sysadmin team one day, rather than keeping everything on the backend team hoping any addition will not impact our global budget and pipe.

tschumilas · November 6, 2020, 4:19pm

I’m wondering about the proposal that instances could fundraise for their own tech person. We’d be very interested in doing this in OFN Canada. In the past, I think the community preferred we do not do this. There was a feeling that IF OFN-CAN could access money for devs, it should go to the global pipe as a priority. This is what we have been doing. Having someone as a volunteer is fantastic - but it has limitations - often volunteers have ‘day jobs’ and can only contribute a few hrs weekly to OFN. At present we are putting in a round of proposals - with the opportunity of recruiting a dev to work as an OFN-CAN employee. Duties would be very similar to the list of sys admin tasks that @Rachel lists above. These are tasks that we have not felt we can ask of the global sys admins (eventhough we have and do and will contribute financially to the global pipe.

What do others in the community, and in the current sysadmin team, think about the idea of instances fundraising for and hiring their own sys admin?

Rachel · November 6, 2020, 5:10pm

That’s what I was trying to express above : I feel it is super hard to do because you need to find someone that can dedicate enough time to understand OFN software, but not too much because we all want to dedicate more budget to feature rather than architecture/integrations. And doing sysadmin sporadically does not lead to a good quality of sysadmin…
So I sense a high inefficiency in hiring someone locally that would do only that. Plus we would fallback to the problems we tried to avoid when creating the sysadmin pool: avoid only having one person that knows everything etc.
That’s why I would prefer to fundraise but at a global level, for a sysadmin team, rather than trying to find one local sysadmin.
Does it makes sense ?

tschumilas · November 6, 2020, 6:37pm

It totally make sense. Isn’t mutualizing all sys admin inherent in our commitment to a global instance? If so - we don’t want instances hiring their own folks.

Yet, I know how much having Sean here has made a difference in OFN-Can - to our staff, our other volunteers and our users. Our service to them has improved tremendously.

If he couldn’t volunteer any longer - we’d be really stuck. Just like non-core instances are really stuck today.

There has to be a solution to this - we need to get out of the box with our thinking. Are there other ways we might structure our collective resources?