Build resilient systems at scale
October 12–14, 2015 • New York, NY

Chaos monkey on your laptop: How to simulate harsh infrastructure conditions in your local tests

Matthew Campbell (Digital Ocean)
12:00pm–12:30pm Monday, 10/12/2015
Location: Rhinelander South
Average rating: ***..
(3.67, 6 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

The audience needs to be a technical audience of developers, sysadmins, or technical managers. They will need to have existing projects that they want to incorporate stronger testing strategies in.

Description

At Thomson Reuters, we are responsible for one of the largest financial instant messenger servers on the planet. We have very strict users, and they want a highly reliable software that gets updated often. We initially found a lot of difficulties when we had infastructure failures as simple as database servers or switches going down. Over time, as we started to move to multiple datacenters, netsplits were a bigger concern. How do we let the developers test these conditions easily without taking an entire day with the Ops team? How do we write code and validate it so that we can make sure that it can work in a netsplit or a machine failure?

What is Chaos Monkey?
It’s a practice and software devised by Netfix, by which you randomly kill servers in a production environment. We wanted to have an easier starting point, and wanted to be able to write integration tests on our laptops that simulated network splits.

Virtual machines and Docker
Lots of organizations use virtual machines, but not usually on the developer’s workstation. We started spinning up multiple virtual machines and Docker images within our local test runs. This allowed us to do things like script disabling network cards, or new hosts coming in and out of our instant messenger cluster. We were able to move this practice onto Jenkins and Amazon, despite the fact that we use physical hardware for our production systems.

Moving to multiple live datacenters
Historically we have only used multiple datacenters to do hot/cold failovers. We wanted to do a lot of simulation in our test suite, to show what would happen to replication for MySQL/Redis and other components when we introduced latency to the picture. I will show you some techniques we used to introduce latency between VMs on the local machine.

DevOps and continous integration
I will show a brief overview of how we ended up integrating these test suites on Jenkins to spin up machines on Amazon. I will go into some detail about how using cloud providers for testing is great even if you deploy on bare metal, like in our private datacenters. Our Ops team appreciated being more confident with each release, and was able to cut our release cycle by 50%.

Photo of Matthew Campbell

Matthew Campbell

Digital Ocean

Matthew Campbell is a microservices scalability expert at DigitalOcean, where he builds the future of cloud services. Matthew is a founder of Errplane and Langfight. In the past, he worked at Thomson Reuters, Bloomberg, Gucci, and Cartoon Network. Matthew recently presented at GothamGO, Velocity NYC, and GopherCon India and is the author of Microservices in Go, published by O’Reilly. He blogs at Kanwisher.com.

Stay Connected

Follow Velocity on Twitter Facebook Group Google+ LinkedIn Group

Videos

More Videos »

O’Reilly Media

Tech insight, analysis, and research