This talk will have 3 main parts:
- An overview of how deployments worked before, and what issues we had because of this
- A description of the process we took to transition to our current system, along with a few details of the current system
- The challenges we faced during the transition, how we overcame them, and how we would do things differently
The third section will be the main focus of the talk
When I joined what was then EdgeCast Networks (now part of Verizon Digital Media Services), code deployment was very fragmented. There were many teams writing software, and each had developed their own release process. While there were some shared deployment tools, there was never any standardized process for using them. In spite of these shortcomings, everything mostly worked (most of the time), and the company experienced a lot of success; the EdgeCast CDN service was reliable, performant, and growing.
However, there were a number of (predictable) problems caused by the lack of standardized deployment processes:
- There was no easy way to see which versions of code were deployed where.
- Large discrepancies in code and configuration across our network (servers provisioned at different times would have slightly different software versions and configurations).
- The deployment process would have to be repeated many times over weeks (to make sure servers out of production during deployment received the new code).
- It was difficult to control when code was released. While slow code rollout was the norm, it was very easy for code to be released unexpectedly.
- There was insufficient coordination between teams. People deploying code would step on each other, and our network operation center (NOC) was not always fully aware of what was being deployed and where.
It was generally agreed upon that this needed to change, and that we needed to improve our release process.
Our first step was to separate our development tools from our release tools. Previously, development and deployment both were done in SVN. We moved development into GIT, while leaving deployment in SVN. By splitting development and deployment into different systems, we were able to encapsulate the deployment process, allowing us to iterate on deployment without affecting development.
We created a ‘bundling’ process, where GIT tags became versions. Each server type specifies which version belongs on those servers. Now, it is very easy to see which version of what software is on each server.
We created a service, called CoalMine, that manages the process of deploying new ‘bundles’ to servers. It runs our canaries, choosing small sets of servers to receive the new software, and slowly expands the canary until the new version is global. It also handles the notification process (to allow our NOC and other developers to know how deployments are going). It also provides access to A/B comparison between canary and non-canary servers.
For a while, we maintained both systems. We slowly converted groups of developers and server types over to the new system while allowing the others to continue using the old system.
We ran into a number of obstacles along the way.
- Developers don’t like change, and like forced change even less. We had much better success when we worked closely with teams and let them transition slowly, at their own pace. A great method we found was choosing a small project that a team was working on, and converting that first. When they saw it in action, they were much more eager to transition the rest of their projects.
- The transition took longer than we thought. Large organizations have a ‘long-tail’ of small projects, and converting them all took a lot of time. Being prepared to be in the transition state for longer than expected helped a lot.
- There were a lot more workflows than we initially realized. We thought we had created a flexible enough system, but we found that we needed to make a lot of changes as we transitioned more teams. It would have been helpful to have spent more time getting to understand the workflows of all the teams before designing the new system.
- Our most successful team transitions were the ones where we had already spent time working with them on their old deployment method. Since we had sat through multiple deployments in the old style, we knew their requirements and they felt confident that we understood their needs.
- Developer trust is key. We didn’t start this project until our DevOps team had been at the company for over a year. If you come in as an outsider and try to implement a new system right away, developers will NOT go along with it. The company had made attempts before to transition to a better deployment system, but those other attempts were made by people that weren’t at the company long enough to gain developer trust.
- Communication is key. Many of the issues that arose were caused by one side (either the developer or DevOps) not knowing something was happening. No one likes to be surprised by a change to their system.
- Empathy for your developers goes a long way. When something doesn’t work, and developers are upset, listening to them and empathizing will often cool their anger.
- Admitting when you mess up is huge. While this is true for all professionals in all fields, it is especially true for engineers. If you tell someone that you know you messed up and are working to fix it, you will cut short the blaming and berating that might have otherwise occurred.