OFN Australia adopting the common release cycle


#1

Following some discussion on Slack, we reviewed our deployment practice. We recognised that a lot of things changed within the last year and decided to deploy releases like everybody else.

Where we come from

When the Australian team was the only one, we were on our own journey to find the best delivery process for us. In the beginning we would merge pull requests and then do a big deploy from time to time. There was always the question: Should we push it? Do we risk it? And then we would play Salt’n’Pepa’s Push It while running the deploy process to the production server. Sometimes we did that on a Friday afternoon and regretted it, because we had to quickly fix things after hours or at the weekend while having other plans. So we came up with a rule: no deploys close to finishing work.

Finding out what went wrong was particularly difficult with big pushes. Which change broke it? So we worked on continuous delivery. We set up a delivery pipeline in Buildkite:

  1. Run all automated tests. We had our own server at first and then switched to Travis.
  2. Review the pull request. One review by a peer developer was enough. It was Rohan, Rob and me at the time.
  3. Have the pull request be in sync with master. We rebased them ourselves, but the CI server would merge master into it before staging as well.
  4. Stage the pull request.
  5. Let Sally thoroughly test it. She would not just test the feature, but also test all areas that could be affected and always test login and checkout to prevent any S1 reaching production.
  6. Merge the pull request into master. The script would sanity check that the pull request is still up-to-date and nothing else has been merged into master since staging.
  7. The new master would be deployed on production automatically.

Since this was modelled in the Buildkite pipeline, there was no way around it. It was the only way of doing things and the CI scripts would enforce this process.

We would test, merge and deploy several times per day. Sally had become an excellent tester by the time already and we were super confident. New bugs in production were very rare.

When there was an international community, we started creating releases. They were really just a way to broadcast the changes and remind people to update their servers. They had no direct use for us.

The changes within the last year

After our gathering last year we changed a few things to open the pipeline for the new international team, random contributions and a quick delivery for everybody.

  • We abandoned the strict use of our Buildkite pipeline to enable other people to merge without being coupled to our deploys.
  • Anyone in the core dev team can merge a pull request via Github.
  • More people started testing and we tried to formalise the process and transfer testing knowledge.
  • Sally, our testing guru, our only local tester, is leaving.
  • The number of pull requests increased.
  • Releases are created by the dev team and they became more frequent.
  • The Aus dev team is smaller (just me).
  • I now deploy the master branch to production several times a week, usually including multiple pull requests (not at the end of the day).
  • We introduced release testing.
  • We skip basic release testing on every pull request. (Right?)

I’m not sure about the extent of the pull request testing at the moment. I’m not sure if it was a deliberate decision to skip more general testing for every pull request, if that was just lost in the transfer of testing knowledge or if we still do it. Can you answer that @MyriamBoure

Our planned changes in Aus

Beginning 2019, we would like deploy releases like everybody else. This means that we won’t be the guinea pig any more and behave like the rest of the global team. While we still believe that frequent deploys minimise the risk of introducing multiple bugs at once and simplifies finding the bug, we are also recognising the new risks of doing things differently to the rest of the team and not having everything tested by Sally any more. The fact that there are release tests implies that master is usually not tested for production use any more. It is also not feasible to deploy one pull request at a time without a new way of automation.

This means that we will all be on the same page and pull into the same direction together to improve our release and deployment process. :muscle:

My vision

As you can hopefully see from this post, our practice of continuous delivery was the result of a lesson learned the hard way. And I still think it is the best way to minimise risk and we should try to find a new way to make it possible. We need to compromise now to build a team, but I hope we won’t repeat the mistakes from the past.

Having a global team is a problem that made our old process infeasible, but it’s also the solution. When pull requests are merged while I’m asleep, Europeans can actually deploy it to our server and minimise the risk of affecting prime shopping time.

I would like us to speed up the release cycle to a point that every new pull request merged into master is a new release. The pull request testing and release testing would come together again. We then need a new way of deploying this to several servers in an iterative way to minimise risk.

Just an idea. Imagine a pull request is merged in Europe. You can deploy to Australia while everybody is asleep there. If something like a database migration fails, you have plenty of time to fix it. If everything is fine, maybe after an hour, you can deploy to an American server. People there may just be waking up. We can work our way through the time zones, deploying one server at a time, one pull request at a time.


#2

I love the idea of turning a challenge (time zones) into a strength!


#3

thanks @maikel for describing this process in detail!
I agree with your vision here.

On top of your suggestions here, and I think in line with what you describe here, in todays global meeting we discussed the possibility of starting to adopt continuous deployment for all instances in terms of deployment, i.e., we could start deploying a given release (as with your vision they will become very frequent) in all instances at the same time. This way we save lots of deployment time for everyone and we will become a lot better at rolling back releases in case there are problems. What do you guys think about this @maikel @sauloperez ?


#4
  • continuous deployment: sounds great
  • more frequent releases: awesome
  • deploying all servers at the same time: What about reducing impact if there is a bug? The rollback provided by ofn-install is a bad solution, because it loses data. But I agree that deploying shouldn’t be the bottleneck. And we don’t have any acceptance criteria for a successful deploy other than ofn-install doesn’t report errors. We could wait a few minutes to see if any alerts are triggered. Yes, let’s do it!

I would suggest a server list sorted by time zones to deploy the releases to. The person doing the release can start deploying as well. They know what changed, because they wrote the release notes.


#5

Great that you raise this topic @maikel! I think we all want to improve and make this process quicker.

The rollback provided by ofn-install is a bad solution, because it loses data

Sure, we have proven that the rollback doesn’t work. That’s what happens with rollbacks; they never work until you use them. Simple solution; let’s fix it. It’s not rocket science.

The only tough point I see is the release testing for each PR/small release. Doesn’t take too much time for every PR we want to merge? Can you share details of what you used to test in AUS and how long it took? Sorry I have limited experience in software delivery approaches.

Also I find quite related to this the initiative of app monitoring. When we have it on production it’ll quickly tell us if a release needs to be rolled back. It’ll show us whether all integrations continue working or not.

Overall, I say let’s do it! I don’t mind breaking things as long as we react quickly. I think it’s the only way to improve the product much faster which we’ll make us and our users happier.

Shall we create issues to:

  • Fix ofn-install’s rollback
  • Group hosts by timezone so that they can be deployed together (saves precious dev time!). I see the following ones:
    • UK, FR, KA, BE
    • AUS
    • CAN, USA

#6

I’m not sure if it’s broken. I meant that it’s not a good solution. The deployment creates a backup with all files and a database snapshot. If we do a rollback, it moves the files back and also reverts the database to the original state. But I see that as a problem, because all transactions between deployment and rollback are lost. It’s a destructive action that should only be done if there is no other option. Maybe we need a deployment emergency plan like this:

  • If there haven’t been any migrations: deploy the previous commit.
  • If there have been migrations: First rollback migrations, then deploy previous commit.
  • Can you fix it manually?
  • If the damage is bigger than losing all transactions since deployment, then perform a complete rollback.

@sstead Have you documented which additional sanity checks you do with every pull request? I remember that you always tried to log in and check out. Anything else? I guess there was some intuition as well. You would test things that caused trouble recently.

Let’s start documenting the deployment status so that we can use it for coordinated deployments.


#7

As far as my experience goes, I would let rollbacks deal with app code and leave the DB rollback to devs. 90% of the times we won’t to rollback it, the others it’ll be about reverting the last migration, not to mention there are practices on db schema migrations to mitigate these.

I’m fine with the plan you suggest. If we separate db rollbacks from app rollbacks then, instead of deploying the previous commit we can simply execute the rollback. That should be way faster as it’s just changing the current symlink to point to the previous deploy’s folder. Then, we can focus on what needs to happen at db-level.

I like the idea of knowing the status of each instance but I think it’ll be easier to maintain up-to-date if we implement Display software version once and for all. It’s been already a few times I wished we had it. It’s seriously a 15min thing. Have an endpoint return the metadata we need beyond the release version and aggregate them in an HTML table and there you have it. No human intervention.

BTW I used to use https://capistranorb.com/documentation/getting-started/rollbacks/ back in time. It might be worth checking for reference.


#8

My idea for the deployment status page is not just to display the version of each instance. But it’s also a todo-list for a person deploying all the servers. And the servers are already in the order of time zones there. So it’s a tool like a checklist to manage the deploys and make it visible.


#9

First paragraph:

In the majority of failed deployment situations, it probably makes more sense to revert the bad code and redeploy, rather than running deploy:rollback. Capistrano provides basic rollback support, but as each application and system handles rollbacks differently, it is up to the individual to test and validate that rollback behaves correctly for their use case. For example, capistrano-rails will run special tasks on rollback to fix the assets, but does nothing special with database migrations.

That’s exactly what I meant.


#10

Probably @Rachel would be better placed than me to answer… I like the idea of continuous deployment and in that sense release testing doesn’t make much sense. When we do production upgrade usually (we didn’t lat time…) we do a sanity check which is :
1- I can login
2- I can create a product and it appears correctly in the shop as supposed (so check the OC quickly)
3- I can checkout
4- I see the order on the order view

I think it’s 5 minute to check that, so we could do that for each PR if we move in that direction. And as you say, try to test some more “intuitive” potential impact it could have on top of pure “what to test” description… intuition can only come with experience and time to dedicate to testing (if tester is in a hurry will not take the time maybe to pause and think what else can be impacted…)
So for new testers it would be good to have a Ha or Ri tester review the test because she would think of some potential impact to test that the Shu tester would not think about… @Rachel I’m not sure we did that so much with you, maybe we should have ! How do you feel about this “intuition” on what can be impacted ? I think I remember some conversations where you disagreed about the fact that we should try to think about any case that the PR could affect… but it is a fact also that we had more bugs recently coming through, and even if probably multifactorial, I think Sally was doing more extensive “intuition based” use case testing…

I don’t know what is best, spend more time in testing, break more things, and so release less quickly but more safely, or do less “intuition based testing”, release without being afraid of bugs, and then fix them quickly… this “testing design principle” is essential to me in this conversation about continuous deployment…

But I don’t have much experience in software release pipe management so your experience will be precious @Rachel, and probably @danielle can have good inputs there as well with her own experience (and her suggestion to hire a professional tester).


#11

Nope that’s not what I said. Of course introducing new stuff should be tested and the rest of the platform should remain the same and we need to make sure of that. But as much as the RI dev who is working on, there will never be a RI tester able to think of everything. This is human nature :slight_smile: So, we should have detailed testing processes and run them every time. By detailed I mean with key info such as steps, URL, profile… But even with this there will still be mistakes because this is too much for a human tester to do on each PR (let’s not forget test should be ran on different browsers yet along different devices). So we need automated tests for that. I don’t see any other way to improve testing quality.

Sally confirmed on slack that there never was a sanity check per PR (sorry I can’t find the post anymore :frowning: ) But you were tag in this discussion I think @maikel :slight_smile: We introduced regular sanity checks on stagings with release stagings AFAIK. The rest of the sanity checks were done in production, but I’m not even sure every instance are aware that they should do that :confused:

Well… about that… we had a pretty annoying week true. Is this still a fact? I would love to improve what I’m testing, but I have no fact-based info where I can see that a specific bug was introduce by a specific PR that was not “extendly” tested, and therefore know what I was missing in testing. The only one I can think of is the instagram links, because yes I went to quick on this one and forgot front office.
So unless there is, the only solution to me is documenting testing processes so that A. every tester knows what to do in each case (and you don’t need to be RI to do that, if there is a process anyone can follow it) and B. we can cover our app with more automated tests.
And yes like, I said on slack, a professional tester could be very helpfull on that.


#12

Thank you @Rachel and @MyriamBoure for all that input. We do have a lot of automated tests. Login, checkout, orders page, all that is tested automatically. A detailed testing manual can just be converted into an automated test. Computers are good at following instructions. But there are some things that our tests don’t cover:

  • Layout issues. It looks ugly, may look unusable, but all text is still there and you can use it. The solution to this would be visual acceptance testing. One tool for that is https://percy.io/.
  • Real communication with external services like Stripe. The solution to this would be to create automated tests for a staging server. A good tool for that is Selenium. @Rachel Have you worked with it? You can actually record a testing session and replay your clicks and form fills for any pull request.
  • Complex real world data. The space of possible data is too huge to be generated in an automated way or for a test environment. We could test with every production database and would still not be able to catch a bug that appears with data entered by a user tomorrow. I don’t know of any reasonable solution to this problem. We can just try to make our test data more and more complex and we can run database migrations with production data. But there will always be bugs in production, we have to live with that.
  • Using context of recent events to try different edge cases: intuition based testing. Computers are not smart enough to compete with humans on this level.

These are some ideas, but I think that our testing process is pretty good already. I agree with Rachel that most of the recent bugs have not been introduced due to bad testing. They appeared due to more complex production data which we will never be able to cover completely. It’s growth pain.

@Rachel How long does release testing take at the moment? Do you agree with the 5 minute assessment from Myriam?


#13

Great summary @maikel ! It express what I was trying to say in a more comprehensive way :slight_smile: I haven’t worked yet with selenium but I 've seen demos :heart_eyes:

About release testing : what Myriam was refering too was production testing. Release testing is heavier with more use case see example here : https://docs.google.com/document/d/1NjxmE11lA2z_JJ0kwSZvfVudliik3kEoMMUw-tU_j7g/edit#heading=h.4ctypvfojpgm

It took me at least one hour, and I didn’t test everything, Myriam and Sally took over some part that I didn’t know well (reports e.g.)


#14

Wow, that is huge. I would like to make two suggestions.

  1. We should review the release testing template and find the scenarios that are covered by automated testing. We don’t need to test those manually.
  2. I would like to avoid the double-checking of all the merged pull requests. But that needs a little modification of our process. After a tester gave their okay, we just merge it into master. We developers do that anyway. There is no other assessment we do. We just merge it. So we can automate that or give testers permission to merge. And if every successfully tested pull request is in master, we can merge master into every pull request when staging and it will be up to date. It is very unlikely that a pull request breaks an existing feature.

#15

I really like those two suggestions @maikel :slight_smile: let’s see what others think about this!


#16

I was talking about production testing yes so if every PR would become a release we would do the similar production like sanity check IMO.

One thing I don’t understand @maikel is that if there is continuous deployment (every PR merged is a release) there is not anymore release testing, only PR testing, so I don’t understand your point 1. We will always want to do sanity check about checkout even if there are automated test… so we would still do manual sanity check for each PR.
I agree that double checking all PR is not ideal (and doesn’t make sense if every PR is a release!), but @Rachel idea was that testing them all together can reveal bugs.

I’m wondering one thing : shouldn’t we have rather than continuous deployment (every PR is a release) in our case have regular releases like we do now, but test the release in a “pre-production server” as it is usual in many software projects ? Which would be UK/Aus or Fr production like for instance to have real data and reveal bugs when we test the release.

If we do continuous deployment, then I don’t understand why we wouldn’t all be directly on Master ? If we want every new PR to be directly a release, it would avoid the upgrade process… like Aus was actually. If there is an issue with a PR we revert the last PR… for me the evolution toward OFN install was to enable quick releases and upgrade but is kind of contrary to continuous deployment (if I understand well). Is it adapted to continuous deployment like you suggest Maikel = each PR becomes a release ?

Maybe I misunderstood some bits so please correct me, but what I understood in last hangout discussion was that we could have quick and regular releases as we do now, with 2/3 groups of simultaneous upgrade (by time zone) but I think it might be too much of a mess to go toward “each PR is a release”… adding a release testing on a pre-production server could avoid the “guinea pig” effect…


#17

@MyriamBoure there are a few things that I don’t understand in your proposition :

That is exactly what we are doing now : we test the release on a staging server. What are you suggesting we should change or add on that?

Maikel’s suggestion number 2 avoids this problem : the proposal is to stage each PR with the most up-to-date master. So except for the last PR of a release, we would have already tested them together. That would save us a great deal of time, and also increase our testing quality.

IMO you have to separate the discussion on automated tests (Maikel’s point 1) and continuous deployment. Here point 1 is about increasing testing quality. You can have automated test and continuous deployment or automated tests and releases.


#18

Agree with @Rachel. I feel we’re talking about several things here and although we might want them all we can’t have them at the same time.

So to me, continous deployment is something I’d like to aim for next year but first, we need to improve the current process. So let’s put continuous deployment aside for now.

I think what @MyriamBoure with a pre-production server is a server that uses production’s DB so that we test with the same complex a amount of data.

I broadly agree with @maikel’s ideas so let’s turn this discussion into actionable issues that allow us to iterate.


#19

I was wondering if we can make our PR testing good enough for releases. The process would be to make release testing more efficient and include more into PR testing until they are the same.

Very well. It sounds reasonable to me to test the checkout as the most business critical feature. For example, the production and staging environment is communicating with Stripe while the tests simulate the communication with Stripe. But we don’t need to test five different enterprise fees manually, because that should be tested automatically and the communication to Stripe is not affected by the type of enterprise fee.

I think it’s good to review the manual testing from time to time and identify parts that can be automated so that they don’t need to be tested manually. And I agree that there will always be some parts of integration testing that can’t be covered by automated tests.

As Rachel asked: What is the difference to a staging server? I think you are saying that our staging servers don’t have that complex data. You would like production data on a staging server. That is almost possible. The process of copying the data from production to staging involves some changes though. Secrets like Stripe keys need to be replaced or removed so that we are not transferring money into real accounts. The email setup needs to be changed so that we are not sending emails to real people. We need to put the application in a sandbox to test with production data without affecting the real world. That’s a medium sized project.

We would. With Continuous Deployment master is always the latest release. We can give a name or leave it.

Ofn-install and our custom deploy script are just two ways to update a server to a specific version of the code. Both work and are efficient enough for continuous deployment. With either of them you can deploy master or a release. We invested work in ofn-install so that it works for all instances, is more reliable and we can share the sysadmin work globally.

Talking to the other core developers, we agree that we would like to work towards continuous deployment. But that’s a slow process, because we need to clarify the impacts on the whole process and adapt our testing and deployment methods. We changed a lot within the last year and it still feels like we are experimenting with out processes. We need to run the current model for a few cycles to identify what works and doesn’t work and then iterate. We are trying to become more efficient and make more and smaller releases. And one day the smallest release contains only one pull request.


#20

Maikel, your 2 suggestions above are awesome! Specially 2. I think we should go for it. I agree with everything you are saying on this thread, not much I can add.

re point 1. I wouldn’t worry about sanity checks in staging, I dont see much value in repeating the automated checkout test in staging… how probable it is to have a green build with a broken checkout in staging? Otherwise, I think testing integrations in staging and sanity checks in production are important.

re point 2. we should stop staging PRs without merging master into them (AUS: semaphore staging is merging master into the PR, UK: I am not sure the UK staging process is merging master, ES/FR: not merging master). We can do this as we move all staging processes to semaphore.