At Thomson Reuters, we are responsible for one of the largest financial instant messenger servers on the planet. We have very strict users, and they want a highly reliable software that gets updated often. We initially found a lot of difficulties when we had infastructure failures as simple as database servers or switches going down. Over time, as we started to move to multiple datacenters, netsplits were a bigger concern. How do we let the developers test these conditions easily without taking an entire day with the Ops team? How do we write code and validate it so that we can make sure that it can work in a netsplit or a machine failure?
What is Chaos Monkey?
It’s a practice and software devised by Netfix, by which you randomly kill servers in a production environment. We wanted to have an easier starting point, and wanted to be able to write integration tests on our laptops that simulated network splits.
Virtual machines and Docker
Lots of organizations use virtual machines, but not usually on the developer’s workstation. We started spinning up multiple virtual machines and Docker images within our local test runs. This allowed us to do things like script disabling network cards, or new hosts coming in and out of our instant messenger cluster. We were able to move this practice onto Jenkins and Amazon, despite the fact that we use physical hardware for our production systems.
Moving to multiple live datacenters
Historically we have only used multiple datacenters to do hot/cold failovers. We wanted to do a lot of simulation in our test suite, to show what would happen to replication for MySQL/Redis and other components when we introduced latency to the picture. I will show you some techniques we used to introduce latency between VMs on the local machine.
DevOps and continous integration
I will show a brief overview of how we ended up integrating these test suites on Jenkins to spin up machines on Amazon. I will go into some detail about how using cloud providers for testing is great even if you deploy on bare metal, like in our private datacenters. Our Ops team appreciated being more confident with each release, and was able to cut our release cycle by 50%.
Matthew Campbell is a microservices scalability expert at DigitalOcean, where he builds the future of cloud services. Matthew is a founder of Errplane and Langfight. In the past, he worked at Thomson Reuters, Bloomberg, Gucci, and Cartoon Network. Matthew recently presented at GothamGO, Velocity NYC, and GopherCon India and is the author of Microservices in Go, published by O’Reilly. He blogs at Kanwisher.com.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com