One question we often receive at Wrk from fellow startups is, "how were you able to get to a point where you can push your system every day?"

This is a great question to ask. Release engineering (or RelEng) is a problem that all companies face, from the smallest in-the-garage startups to huge corporations like Google. Many outages and bugs occur due to problems in the release process. Getting a proper process to get code to production is a great way to avoid these outages and provide a better user experience.

Why?

Before diving into that, we have to ask: Is this the best use of our time? Engineers don't have a lot of time, so it's good to know what something will do for you before you spend time on it.

Benefits from deploying on a rapid release cycle:

You get rapid feedback on changes.

When experimenting with features or technical changes, it's often unclear what the impact will be. For example, you may not know how a user will react to a certain change, or you may not know if a certain fix will work in production or not. If you make the change and then have to wait a week to get feedback, you're operating on a much longer feedback cycle of weeks than one of days or hours.

Users like to see progress.

At our stage in a company, users are fairly tolerant of bugs. However, that doesn't mean that they like them. If they see that bugs get fixed quickly, they feel like you are on top of things, and it gives them more confidence that even if things go wrong, you're acting to make things right. If you can fix these problems in hours or days rather than weeks, it inspires much more confidence.

The cost.

The cost of fixing a bug increases nonlinearly with the time between when it is introduced and when it is discovered. When I land a pull request, and we see an issue in one of our environments a day or two later, it is often pretty obvious to me what the root cause is. I can whip up a fix quickly and push it out right away. However, if it's been a week or two (or, god forbid, a month or more), I must regain the context before I can even think about fixing the bug.

The cost also increases non-linearly with the number of changes going out. When you push a lot of changes out, and something goes wrong, you often have to isolate the specific change that caused the problem. If on a daily release cycle, there were only two changes that went out (sometimes there's even just one), there is no real work to do—you can often figure out which change broke things just by looking at the commit message.

There are a few downsides to the push:

  • It takes investment to put these systems in place, and it either requires discipline to follow the process or limits the flexibility of engineers when it comes to rolling things out (and we know how much engineers hate being limited). Sometimes when you first get going, it's much harder to follow a solid process than to do it in the ad-hoc approach you were doing it before.

  • You have to build infrastructure to gate feature releases. On a rapid release cycle, you can often end up rolling out features accidentally before the customer support and marketing teams are aware of it. We had to build some infrastructure into our release process to ensure that features can be turned on or off so that we could properly coordinate with other teams.

Those wins significantly outweigh the downsides for us, so we're quite happy with the investment we've made.

How?

So how did we at Wrk get here? What did we do to be able to release reliably at this velocity?

There is no specific thing we've done, but rather a set of different things we've put into place that collectively add up. I'll go through each one individually.

Attitude

Before you can do anything, you need to shift your team's attitude. No amount of process will help if your engineers are not on board with what you want to do.

The most important thing to understand on your team is this: humans screw up. Even the best engineers screw up. If your process depends on humans always doing the right thing, you're not going to have stability, and you're going to struggle to get to a rapid-release system (among many other things). While engineers often admit that other people screw up, they must fully understand that they, too, screw up and that a little process goes a long way in preventing those screw-ups from having a serious impact.

Once you've convinced them of that, they'll be more open to the process improvements I'll describe next.

The Coding Process

For our coding process at Wrk, we've put three restrictions in place that help with stability:

  • You never check the code in the master. We've set up Github to reject any pushes to master, even with a forced push. If you want to change, you create a branch, push that to Github, and create a pull request.

  • All code requires static and unit tests to pass. These are run automatically by CircleCI when the branch is pushed to Github, so no action is needed by a human here. We require things like pylint and mypy to pass before you can merge anything.

  • All pull requests require a code review. Software engineering research dating as far back as the 20th century has demonstrated that peer review (code + design review) is the single best thing you can do to improve the quality of code—it has even more of an impact than unit testing or static types. It has the added benefit of helping share knowledge between team members, especially juniors or people who are not completely familiar with the stack you're using.

Because of these controls, we establish the pre-condition that the code in the master is fairly solid and is in a state to push.


At Wrk, the coding process is made up of three integral parts. Photo Credit: Fotis Fotopoulos

The Push Process at Wrk

Getting Started

At Wrk, we generally start a push every morning. This gets triggered by a cron job randomly selecting a team member to do the push. The entire team is part of this rotation, including those who have just started that week. This forces us to have a solid push process because if you need specialized knowledge to push things, you don't have a streamlined, automated process.

A push can start from any commit. Wrk has Wrkflows in CircleCI that kick off for every pull request that gets merged into the master, and it starts with a confirmation step. When the randomly selected victim pushes shepherd goes into the CircleCI dashboard, they just filter based on that workflow and start the latest one. All they do is click a button.

Tests

Once the workflow kicks off, it runs a bunch of tests. This includes a repeat of the unit test process that happened on their pull request since it is possible that merging code will introduce issues that didn't exist on separate branches.

The other set of tests we run at Wrk are end-to-end integration tests. These bring up a Docker compose environment in CI, and run through a mix of black-box tests to confirm that vital high-level functionality works: for example, can a worker do a job? Can people get paid? This aims to ensure that our core product is working so that our users can get what they need doing.

"If you make the change and then have to wait a week to get feedback on it, you're operating on a much longer feedback cycle of weeks than one of days or hours."

One of the important practices for all this to work correctly is that your tests run fast. The whole process of running our unit tests, which includes setting up the Docker instance and making sure that all dependencies are up to date before running the tests, takes about 5 minutes.

Sometimes you need to slow down

The integration tests take a bit longer, about 20 minutes. This is the slowest part in our entire push process and is a great candidate for speeding up. They run one after the other now because they affect the database, and it would be hard to ensure that each test is independent if they ran simultaneously. An improvement is to use "splits", which spin up different Docker compose environments and allocate the tests among them. This would require some analysis to see the average runtime per test, and then allocate the tests across the splits so that they all run in a roughly similar time.

The last set of tests we're not doing at Wrk right now, but I'd like to introduce for stability, are DevOps tests. I firmly believe that DevOps should be treated as another software engineering discipline. The same best practices applied to our code should also be applied to our production environment: tests, loose-coupling, and readability.

A good example of a DevOps test ensures the configuration is sane. This would verify that the different deployment configs are consistent with one another, and with the staging/production environments. For example: when a deployment's config is referencing a particular Kubernetes secret, does that secret exist? If it doesn't, fail the test and don't continue with the push.

Deployment

Once all the tests pass, it's time to deploy. We're using the Kubernetes engine on Google Cloud, so we can have automation handle it for us, but you should be able to adapt the ideas to the environment you're using.

Docker and Kubernetes simplify our lives a lot here. To deploy our code we at Wrk build the Docker images for each service, push them to GCP, and then roll them out to Kubernetes. You can easily run these images locally or in an integration test environment. Reproducing the staging or production environment is pretty straightforward using tools like Minikube.

"If your process depends on humans always doing the right thing then you're not going to have stability, and you're going to struggle to get to a rapid release system (among a lot of other things)."

For Kubernetes, instead of using vanilla YAML files, we use kubecfg and jsonnet to do everything. I think this should be the default way that Kubernetes does everything, but I can imagine some people don't like jsonnet and would complain.

Template everything!

Jsonnet allows us to template everything in Kubernetes so that 95% of the configs are the same between staging and production. We also re-use these templates when setting up stress-testing environments, but that's a story for another post. By sharing configs between environments like this, you ensure there aren't significant differences between staging and prod, which is a way you can accidentally introduce bugs. It's also a lot less work.

We have a little script that manages pushing the code, which wraps a few kubecfg calls and a script that manages any Django migrations we might have. By following solid migration practices (another story for another post), we're not really concerned about doing migrations automatically.

Once staging is live, we start the only manual part of our push. Right now, all we do is go down a checklist of things to test in staging. We changed the few things since the previous push, plus the things we can't catch via integration testing. These are things with third parties like auth0 and GCS to ensure those various setups work properly.

The final push from staging to prod is dead simple: retag the Docker images and run the same deploy script against prod instead of staging. That entire step takes about a minute.

Conclusion

As you may have gathered, the secret to our speed at Wrk is to automate everything. Humans screw up, so either take them out of the picture or give them things that are so simple that they can't screw it up—like clicking a single button in CircleCI. By automating the majority of our testing and deployment, we can have a repeatable process that works every time. If something breaks, it's easy to tweak the process to avoid that failure again in the future. This investment will be fantastic for us as Wrk grows, since we will battle-harden the process over time and instead of increasing instability with scale, it will decrease.

One major change we intend to make is our team development practices. Right now Wrk consists of a single engineering team and push all our code at once, but that approach does not scale once you have more than one team that does pushes. Once you split things up, you must start worrying about team coordination. This is not yet a problem that we have had to solve, however, we have already started putting some infrastructure in place to support it. Once we split the team into multiple, I'll write a follow-up to this post about decoupling teams and parallel pushes.

Hopefully, this helps you get moving to a faster velocity. Feel free to ask any questions not covered here, or suggest any improvements that you might have.

Similarly, if you want to learn how other aspects of the Wrk platform operate, book a discovery call today.