Have a Maintenance Mode

makeitgreat
Tags: #<Tag:0x00007fa95980b308>

#1

Following recent server update in France, I would like to propose having a Maintenance Mode when deploying any change on a production server.

I know that future releases should go smoothly but… we never know :slight_smile: And it would prevent users using it during any sys admin activity and then ending up with weird bug cases that we don’t know how it has happened.

How I see it working:

  1. When a maintenance is planned, each instance super admin notify manualy its users a week or a couple of days earlier that the maintenance will take place at a particular time.
  2. At that particular time, the super admin can switch the platform to Maintenance Mode. Ideally the switch would be available in main configuration page, with an ON/OFF toggle.
  3. Once the Maintenance Mode is on, each logged user is signed out and each action targeting a production URL of that instance is redirected to a page like openfoodnetworkinstance.org/maintenance . Some examples that I like:
    Just a simple message with fun picture on a major retailer in France (I know it’s for 404 error but you get the idea^^). The message says “Don’t panic our team is on it”:

Message + email notif when its back up:


4. During maintenance mode, super admins should still be able to login (IP restrictions ?..) in order to proceed to safety tests
5. When test are all green, super admins decides to switch the Maintenance Mode back OFF.

Cherry on the cake: alongside having to email our users to inform about maintenance mode, we could have also a banner on the platform telling them that the maintenance will occur and when (maybe only on admin side).

What do you think ? I would love some dev point of vue to know if this is something doable on OFN. cc @maikel @sauloperez @Matt-Yorkley @luisramos0 @Hugs


#2

There are a few different features here that should be looked at separately. Together they form a great bundle of tools.

  • A maintenance announcement banner. I think that is very useful to set expectations for every user visiting the site before and during the maintenance window. It should mention the time frame. It could also activate itself at a certain time, but it should be deactivated manually when everything is done.
  • An announcement sent out via email. That should only go to enterprise users.
  • A switch to deactivate certain features like sign up, checkout etc. during maintenance. All read access like viewing contact details of enterprises should be no problem for most of the time.
  • Allow admins to test the site. I think that an IP address is a very bad way to identify a user. We could have a special URL like openfoodnetwork.org.au/test_maintenance. Visiting that site could set a cookie which deactivates the restrictions for that user.
  • We usually need to shut down the application for around 5 minutes. In that time we can only display a static page. Creating a nice looking page here would actually be a quick win we could do very easily. We just need to copy it into the right directory with the right filename. It would be a bit more work to make this page editable, but customising it would make a lot of sense.

#3

I agree overall with @maikel thoughts.

But I’d also like to add that to me enabling the maintenance mode should only be used sporadically and for well-defined situations like fixing FR’s production server, major releases like the Spree upgrade or for long down-times as we also experienced in FR this year.

To that end, and given all these are sys admin responsibilities IMO, I wouldn’t allow enabling it from the admin UI. Not only it makes it harder to implement and maintain, but also grants instance managers powers I believe they shouldn’t have. It’s definitely not another instance setting.

Finally, I wouldn’t go crazy in terms of features; we already have lots of things on our TODO list and we should strive for simplicity. That’s why I think

A switch to deactivate certain features like sign up

may not be worth the effort. As for

Allow admins to test the site.

although it adds its complexity to an otherwise straightforward feature, it may be useful and it’s worth considering. Let’s not forget we have staging environments though.


#4

I’d vote for very simple solutions here:

  1. This can be done extracting users emails from the system and sending a manual email.
    2 and 3. This is something that can be easily done on the web server config level.
  2. This is not a simple issue in GH…
  3. Can be done at server level.

I’d vote for a very good definition of a manual process that involves describing the sysadmin tasks and the communications to users. After we have this and we have used this manual process for some time, we could consider the development of these features.

I am saying this for 3 reasons:

  • I have seen much bigger systems working well with manual processes
  • I think we should aim for zerodowntime, so making downtime painful for everyone is a good thing
  • If you need to “verify things in production” it means you have to fix/improve your dev/release process, could be staging but could also be improving rollback capabilities for example

#5

@luisramos0 @sauloperez @maikel

Thank you for your feedbacks! I will edit my original post as it occured to me that I was not clear enough: of course step 1 is manual and can remain this way, I was just describing how I picture the whole process in terms of steps altogether not only automated steps.

Also I fully agree that we should not build something difficult to maintain, my point was just that it is really annoying to have users trying to reach the platform while sysadmins are doing stuff.

I’m not sure we will reach the zero downtime soon, but in the meantime our releases are more often, such as our downtimes…

So would Maikel’s idea of a static page being displayed during our 5 min downtimes be a quick win on this?

Or there are other blockers to take into account?


#6

Yes, I’d create that page and then get all sysadmins to know how to activate it. That will be easy to do and very useful.


#7

A release should not involve any downtime. We had some downtime when we switched servers, but deploying a new release to the same server can happen without downtime. On very active days, we merge several pull requests and deploy each time. That can result in 5 deploys per day without any downtime. I’m not sure if it’s as easy with ofn-install though.


#8

Yes, it is. We haven’t had any downtime due to releases either. It’s all about the reload of the unicorn, so no matter what tool we use to do that.


#9

Ok so from what I’m reading here, is there is no downtime, do we need any maintenance page at all? If we say it’s still good to have it to all the cases where sys admin need to cut the servers (and there will always have some even if not often) then I suggest we move forward as draft icebox to prioritize only with the scope “create a page and make sure sys admins know how to activate it”. I’m not sure it will be prioritized soon but it could be elligible for voting.


#10

Just to share a case, in Katuma we’ll migrate our subdomain from alpha.katuma.org to app.katuma.org soon (I plan to spend some time on it the next three weeks) and we’ll need to show the maintenance page, at least for few minutes. This will be needed for other infrastructure changes in the future, more than releases. In our particular case, a manually enabled page will do it.