How to prevent Continuous Deployment from turning into a Continuous Disaster

These days, one the of most frequent request we hear from clients is setting up continuous deployment. Every company wants it, every DevOps related conference has sessions about it. However, newcomers tend to miss one of the key points of CI/CD – it really isn’t an automation problem, it’s a cultural problem. From a pure technical perspective implementing a build-deploy pipeline is fairly simple with current tools and this has been the case for several years. It’s true that new tools make it even easier – a brand new startup can setup CD in under 30 minutes, but automation was not a show stopper for at least 5 years.

Setting up the automation for deploy-by-commit is easy, but all it does is allow programmers to break production more frequently. When combined with classic dev/ops silos this has a negative explosive effect with catastrophic results.

To make matters worse, some vendors “ride the wave”, guising their selling pitch with CD/CI sessions which give the impression CD is a problem of tools (which their product solve). There is no product that solves the problem of CD because it is a problem of methodologies and culture. This has been discussed at length on various forums (blogs, devops conferences, meetups, talks, lectures, etc.) but is still not as commonly understood as it should be.

It’s worth taking a moment to explore what kind of bugs we see in production – and what can be changed in an organizations culture and methodologies to prevent them.

With a gross simplification we divide bugs into 4 categories:

  1. dumb bugs - a code that is obviously and clearly wrong.
  2. implementation bugs - you have a design that should solve some problem, but the implementation differs from it, many times in a not-so-obvious way. Most concurrency bugs can be classified as implementation bugs.
  3. integration bugs - the classic “but it worked on my system!” bugs. These are bugs that consistently occur when your code base works with some specific external component (e.g. database, external API, etc.) but vanish when the code is tested with another component or in isolation.
  4. engineering bugs - every system is engineered to operate within some range of its input signals, some range of load and with some dependencies provided. Another way to describe this is that a system makes assumptions about the nature of the world - which are occasionally discovered to be wrong. Unfortunately, many of these assumptions are implicit or unknown and are sometimes impossible to test in a lab. These bugs also tend to manifest themselves as cascading failures and are notoriously hard to debug.

Tests

When building a CD pipeline, substantial technical efforts should be invested in having automated tests. Unit tests do a great job rooting out dumb bugs and go a long way to reduce the number of implementation bugs. Integration tests (with a proper staging or integration environment) can help screen integration bugs. Testing under load and on a very broad range of inputs can further increase chances of catching implementation and integration bugs and doing A/B testing in production (using beta environment, feature flags and partial deploys) is even better.

Tests also serve as an incremental spec that ensures you are learning from the mistakes of the past. This means you will add more tests over time increasing the cost and complexity of this part of the pipeline (SOA can help to an extent).

It’s important to observe two things related to testing:

  1. Despite tests, you will never catch all bugs, some bugs will inevitably reach production
  2. Testing does not prevent bugs, it exposes them after they were coded

Culture and methodologies

This is the hardest but most important part of CD. It also has the greatest effect on quality and efficiency and a good systemic way to reduce the amount of integration and implementation bugs that get written in the first place. Culture can help with engineering bugs as well, both in prevention and detection.

Without going into much detail - there are countless awesome sources on the web, the cultural shift revolves around (among others):

  • Small and frequent incremental changes
  • Developers touching and understanding production
  • Sharing Ops responsibility with Devs
  • Eliminating blame
  • Ongoing close cooperation between teams (in particular, Ops and Devs) from the earliest stages of development
  • Automation across the board
  • Measurements and visibility as a primary focus

An often overlooked fact is that culture is coupled to architecture; This is the reason CD thrives with SOA but withers with monolithic architectures.

Such a cultural shift usually has deeper effects then facilitating CD; It tends to help with prevention of bugs (of all types) and improve reliability and visibility of production. It also seems to cause a preference to small polyglot groups but that’s a different discussion altogether. Culture and methodologies are definitely worth investing in - regardless of CD. But if you are building CD without adapting them, then you are surely setting the stage for Continuous Disaster.

Architecture

As mention above, architecture needs to change too. SOA plays nicely with CD and help break an application into small modules that can change without breaking the entire system. It’s also very important to properly separate and abstract data so that code changes don’t require schema changes as well.

I often hear people talking about deployments tools which handle data mutation and schema changes. Personally I believe that this should be solved architecturally, resorting to tooling when all else fails. Schema changes are a common deployment blocker in CI/CD and if you don’t take care of it from an architectural perspective risks and effort will (almost always) grow.

Tools

A common mistake is to use different tools for dev, QA, DBs and production. This is a deterrent for developers and other teams from touching production, because they aren’t proficient with its toolset. It also makes debugging much harder as some bugs are caused by the tools (and some bugs are simply artifacts of monitoring tools). In short, the complexity of the system is smaller the less tools you have (and by system I mean everything involved in the dev-test-deploy cycle).

As much as possible, try to use the same tools throughout your organization - if that means deploying with Maven or managing dev machines with Puppet, then so be it. There will be a large initial overhead but it will diminish over time as people become familiar with tools and adapt them to their needs.

Production visibility and resiliency

Bugs will end up in production and they will become more subtle as tests and quality improves. This means that your production should be debuggable - which usually means visible. This is something that should be factored in from day one, developers need to structure their code in a way that allows easy extraction of meaningful data both during normal operation and during a crisis. Simply plugging in counters and switching to “debug” log level is not enough.

The same goes for resiliency; This is a fundamental design requirement and must be considered very early. Note that resiliency isn’t about simple hardware failures anymore; HA virtualization solutions and (more so) the shift to clustered architectures has largely alleviated these problems. If a single dead server effects you in 2013 you have more pressing issues than CD. Nowadays, most major failures stem from implementation bugs, engineering bugs and human error. Resiliency is now about fault containment and tolerance.

The good news is that Culture can help engineers get it right. Software engineers in DevOps culture usually care and understand these needs and design systems accordingly, so even if your production isn’t visible or resilient yet it will evolve as culture changes. You will start seeing stuff like monitoring, logging, metrics, fault tolerance, etc. as feature requests from teams all around.

Measuring yourself

Cultural change is hard to measure, but there are some measurables worth noting:

  • Cardinality of bugs/issues
  • Work hours invested in automatable processes
  • Number of integration bugs
  • Time to detection of production issues
  • Time to comprehention of production issues - not time to resolution. This is a measurement of how much visibility your production has and how well it is understood by the personnel involved.
  • Code changes per deployment - lower is better. There is some debate on how this should be measured, but lines-of-code is a decent place to start
  • Rate of metrics per code - again, there is a debate on how to measure this. Metrics/(cardinality of external calls) is a good start.

There are many other things you can measure and this list isn’t applicable everywhere. Culture adapts to sub-cultures and we need to revise our methods with it.