Global Sys Admin Process

Kirsten · February 28, 2018, 10:21pm

As part of global strategy we have an objective to be a Single, Cohesive OFN Unit. One of the priority goals in this objective is “any system administrator can support any instance”

This requires:

Instances being set-up in a standard way so easy for any sysadmin to hop in and work with it - see sys admin standardisation
Access - sys admins can access instances when needed [ad hoc, but gradually happening]
Process - people knowing what to work on and when and what is ‘funded’ e.g.
- what has to be done
- how urgent - how is it prioritised alongside / within global delivery pipe
- if/how this work will be paid for

This post outlines a possible way forward for 3 (NB. ‘straw man’ for discussion!). Summary first with reasoning / detail below. @MyriamBoure @danielle @enricostn @maikel @lin_d_hop @tschumilas - thoughts?

SUMMARY

Github global sys admin pipe - @danielle to advise how she wants me to set this up e.g. project?
Global sys admins can count in their ‘global hours’ tasks that are:
- minor support for new non-funded instance with own devs (Slack #deployment)
- pick from top of Github sys admin github pipe
  - a) new instance or b) maintain/upgrade, where ‘funded’
  - c) call from any ‘established and contributing instance’ to emergency support. Issue in GH pipe and then call to Slack deployment for immediate help
The sys admin github pipe may also contain requests from non-funded/non-contributing instances e.g. ‘upgrade USA’. These tasks can be done at individual discretion (e.g. volunteer) or if discussed and specifically agreed by global community for strategic reasons.
Established instances like Aus, UK etc maintain own sys admin resources for now

Sys admin tasks can be tagged to instance in Toggl where relevant so we can keep an eye on costs vs contributions

REASONING / DETAIL

What has to be done

We have developed curation processes for product development and bugs, and now we need a curation and triage process for sys admin.

Types of tasks / issues that arise:
a) Establish an instance
b) Maintenance and upgrades
c) Help! Emergency: someone’s server is down, emails aren’t working - includes troubleshooting after release if issues seen on one instance first but will likely affect everyone

As they are mostly clearly specified (at least the problem is), time critical and non-negotiable, I think the discourse icebox process is too onerous.

Required tasks in a) and b) like ‘set up instance’, re-provision XX, upgrade yy are easily captured in a github issue. @danielle as the github master - can you recommend whether this should be a project, milestone or how would you like me to set it up?

When a c) occurs the reality is that Slack is the most likely place to deal with it. Would be best if the instance with the problem raises a github issue with as much info as they have, whacks a P1 / S1 (or whatever we come up with) label on it but then hits Slack #deployment to call for urgent attention / response / help

How urgent - how is it prioritised alongside / within global delivery pipe

We could likely develop a severity/priority labelling system like for bugs. Let’s get this set-up and I can put the things I know of in there and then propose a prioritisation process

If/how this work will be paid for

Reliable and experienced sys admins are frequently called upon to support deployment and other instances. Some of these requests have budget and some don’t.

Maybe best to think this through with some examples.

Case 1: e.g. Indonesia

new instance, own devs, no money
support required: responses to snags / questions on slack
core funded? allow for core and occasional devs to support

Case 2: e.g. USA

new/established instance, some dev but limited and need help, no money
support required: hands on assistance with set-up, trouble-shooting, upgrades
core funded? Global decision as to whether strategic to extend support in absence of funding. Likely very limited

Case 3: e.g. Canada - ‘established and contributing instance’

established instance, no dev, has money
support required: all sys admin
core funded: yes. $$E into global pool and tasks taken from github sys admin pipe by whichever dev is best placed to do them. May maintain a relationship with key sys admin contact e.g. @maikel to know who to contact directly for c) cases, but likely this will fade over time once global process working
@tschumilas once we agree and adopt this process, and @danielle sets up github (or tells me how she wants it) I will create issues for the things you want done

Case 4: e.g. Aus, UK, Katuma - ‘established and contributing instances’

established instance, has own sys admin, has money/resources, likely has custom ‘surrounding’ sys admin needs e.g. additional wordpress connections, forum etc
support required: none / all
funded: yes
Option 1: retain own sys admin resources e.g. Aus keeps some money and time back in Aus, out of global pipe to ensure our sys admin tasks are dealt with when and by whom we want
Option 2: put our ‘allocated’ sys admin resources (2 hours per week of Maikel) into the pool. Then put our needs/tasks into the sys admin pipe. Again probably assigned/done by Maikel in first instance but perhaps this could change over time
- if our Aus allocation is not needed for Aus it would just flow into surplus for commons sys admin needs. I think this is most likely, our sys admin need here has been total 12 hours this year and that’s because we just completely redid our production server (upgraded and onto cheaper hosting ) and was a total of 20 hours all last year
- if is not enough?

I feel like Option 2 makes more sense in the long term, but with the different surrounding sys admin needs it likely still makes sense to keep it separated. At least until everything is clear and working well for Cases 1, 2 and 3? Open to discussion.

Continuing the discussion from Single global dev team: integration process, rules to be certified (and thus paid), and rates:

MyriamBoure · March 4, 2018, 5:21pm

I think like bugs, sysadmin issues will be an ongoing process so Epic is not really adapted to it. So I would suggest we directly put the issue in delivery backlog pipe. Then will to to inception to make sure we understand the issue and it fits in what we have decided to support. And then go to dev ready. Don’t think we need a project for that.
Actually I think they should just be treated as bugs (we are going to remove the bug epic once we have triaged the existing bugs and I suggested @danielle we do what I just said above). So as you say, if emergency becomes s1 bug and yes, hit Slack, but we should do the same for any s1 bug.
So for me should be same process as bugs. No urgent could be s2 (is new great feature not available to one instance for example) or s3 (if something like something weird on the server but not preventing to use, but could be something bad…).
If we agree on that we should update a bit the wiki page on bug severity to reflect that as well.

I am also definitely for option 2 but agree that it might be hard to start with, OR we need to map somewhere like in a wiki each instance set up so that any sys admin intervening can know in which case he is… and most llikely the local sys admin who already know the case will work on it. So all in all I think we could go from now already into option 2. It might also make it easier to set up local surroundings, like in France we want to set up a wordpress to do like in UK but @paco had some troubles with the way all was set up at the very begining, and could need some help from others to set up the wordpress for instance. This support time should be taken on globa budget as well IMO for contributing instances. I think it’s more resilient for the community if various sysadmin know the set up of various instances… if something happens to the local person in charge, others can maintain it.

tschumilas · March 5, 2018, 3:32am

This will sound like ‘tough love’ but I’m feeling it a bit so here it goes… In Canada the money we have to pay for sysadmin didn’t just fall out of the sky. We’ve written 9 proposals in the last 18 months. So - really, I’d like to say that the sysadmin support to instances without money needs to be carefully looked at. I think it makes total sense to offer initial support, but there needs to be a limit to this. One of the things that helped me so much, was that when we wanted to launch the OFN-CAN instance, Aus folks told me how much money I needed. So maybe we should do this for new instances early on in the process. So like - ‘we can offer some casual support for 3-6 months, but you either need a sys admin, or you need $7000’ or whatever to cover your first year (or something like that).

Second - it would help me a lot to know what my instance budget is. I can’t quite get a handle on it - I just keep writing proposals and grabbing what I can - but really, I should know what it costs us to keep OFN-CAN operational. This is info I can keep in front of users, members and funders here. (People seem to think its all free for some ridiculous reason.) So - knowing annual sysadmin costs is part of this. So once we sort out the above (which all sounds great to me) I’d like a kind of memo (gosh I sound like a bureaucrat that tells me how many sysadmin hours I’ve purchased and how many months, in our experience, that will last. Then I can set an OFN - CAN operating budget clearly.

Kirsten · March 9, 2018, 5:12am

Hi everyone, so @danielle and I just had a chat about this and here’s the plan

Seems like many relevant issues already filed on ofn-install, so it seems to make sense to manage this as a ‘delivery pipe’ on ofn-install.

We’re proposing that we then label things using:

s1 - someone can’t use OFN / critical / please help e.g. UK current problem
s2 - not totally preventing use but is pretty bad
s3 - s5 as you go
upgrade - someone wants an upgrade
setup - someone wants a new OFN deploy
unfunded - someone wants any of the above but isn’t $$ contributor
broadly speaking s3-s5 are unfunded and likely not done, unless someone wants to as volunteer. Same for anything labelled unfunded

We can then mimic openfoodnetwork pipe, with a project for ‘pipe curation’ and epics for things like ‘standardisation’ (our lead global project in this space at the moment)

Then things that are s1, s2, and basic obvious things like upgrade or deploy (funded) can just be picked off and done by devs.

Other more ‘out there’ things like @MyriamBoure’s example would need to be considered in product curation. Does global team prioritise support for setting up wordpress or not? Is this actually a feature / project that should be considered in the ‘standardisation’ system, or should someone who knows how to do it just have been allocated in the first place (potentially improving set-up and documentation as they go and/or pairing to teach someone else)

If we go straight to Option 2, then I think we likely need to include instance tag on the time tracked in project ‘sys admin - support’ so that we know which instances are getting how much time, and people can keep track of this as per Theresa’s request, and our desire also probably to know what’s happening to aus contributed ‘sys admin’ time.

Kirsten · March 9, 2018, 5:47am

So I’m copying the bug severity guide into ofn-install wiki and have couple of questions . .

not sure if we need a ‘triage’ process, or people can just post to slack #dev if they want to question / change / discuss a proposed severity?
would deployment / sys admin examples be useful?

I have created a new zenhub pipeline called ‘instance requests’ for separating the upgrade and setup requests from the more techie issues

I have created an epic for Sys Admin Standardisation and added issues that seemed related to the seed data and delayed job problems for Q1. Please add or remove any where I have got the wrong ones

Could you please use the toggl projects ‘sys admin - standardisation’ and ‘sys admin - support’ for tracking time @sauloperez @enricostn @maikel @oeoeaio and ping me if that is insufficient / doesn’t make sense

sauloperez · March 9, 2018, 12:16pm

I have created an epic for Sys Admin Standardisation and added issues that seemed related to the seed data and delayed job problems for Q1

Do we all know that Q1 is coming to an end, right? :trollface: Jokes aside, I’m going to update all the Toggle entries that I previously assigned to General Dev when working on seed data.

MyriamBoure · March 13, 2018, 5:16pm

That sounds all really good @Kirsten

@sauloperez as you have the tech ability can you review the examples shown https://github.com/openfoodfoundation/ofn-install/wiki/Sys-Admin-Issue-Severity ? I reckon we might not need 5 levels of severity for OFN-install maybe, do we? If yes please provide one example to each. I tried some things on s2 and s3 but sincerly I shouldn’t have jajajaja

I think it would be great though to agree on and write down in our “super-admin user guide” what support you can expect and under which conditions.

Here is a proposal:

If you are an existing instance and contribute reasonably to global pool you are entitled to receive support from the global sys admin team. Your use of that service will be followed to ensure the general balance and fairness of the process. Also your requests will be prioritized as the team has a lot of things to work on. So obviously emergency like server is down, or upgrades will be done in priority. But for some more custom requests like set up a local wordpress on servers that might not be considered as a priority, but if you find someone to do it you might ask support from the sys admin team.
If you are a new instance and have fundings to get support to set you up and maintain your system, you are also entitled to receive support from the global sys admin team under the same conditions as above.
If you don’t contribute at all or in very symbolic proportion to global pool we cannot commit on any support. If you are very engaged in the OFN and contributing in other ways we might decide to help you, or at least of course we can answer your questions and provide some advices. But we highly encourage you to find funds to be able to get some proper support.

What do you think?

sauloperez · March 15, 2018, 10:34am

I took a look at the sys admin severities and remove most of the example as we don’t even have monitoring. We can iterate on it once we come up with new bugs and see how these fit into the severities.

As for the support, that proposal seems fair.

MyriamBoure · April 5, 2018, 3:49am

Ok, added that page to super-admin guidebook in preparation: https://ofn-user-guide.gitbooks.io/ofn-super-admin-guide/content/deployment-and-system-administration.html

sigmundpetersen · September 26, 2018, 1:32pm

@Kirsten could we add this https://docs.google.com/document/u/3/d/1dR2fZp7a9bUp_igzsu13O6C3cKGAi_vVf90Jqy1OX6g/edit to the global drive for future reference?

Kirsten · October 5, 2018, 4:15am

is now in OFN Global/ Strategy - https://docs.google.com/document/d/1dR2fZp7a9bUp_igzsu13O6C3cKGAi_vVf90Jqy1OX6g/edit