Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA

Chaos engineering bootcamp

Tammy Butow (Dropbox)
9:00am–12:30pm Tuesday, June 20, 2017
Systems Engineering
Location: LL20 A/B
Level: Beginner
Average rating: ****.
(4.40, 5 ratings)

Who is this presentation for?

  • Site reliability engineers, software engineers, systems engineers, and engineering leaders who want to create antifragile services

Prerequisite knowledge

  • A basic understanding of production environments and the infrastructure required to run systems
  • Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting
  • Suggested reading: "Introducing Chaos Engineering" and Production-Ready Microservices by Susan Fowler—especially the section on chaos testing at Uber (p. 94)

Materials or downloads needed in advance

  • A laptop with VirtualBox installed
  • Required: Download the three files provided here prior to the conference so you do not overload the conference WiFi connection (You will need VirtualBox to run the Vagrant image.)

What you'll learn

  • Learn how to create an ecosystem to use for chaos engineering
  • Understand common chaos engineering tools such as Chaos Monkey

Description

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:

  1. Start by defining “steady state” as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

Tammy Butow leads a hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, you’ll learn to identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering—and the positive results they have had using chaos to create reliable distributed systems.

Outline

Laying the foundations

  • What is chaos engineering?
  • The principles of chaos engineering
  • Why are many engineering organizations (including Netflix, Dropbox, Uber, National Australia Bank, and Yandex) using chaos engineering, and how can every engineering organization use chaos engineering to create reliable systems?
  • How to get started using chaos engineering with your own team and how to measure success

Chaos tools

  • Common open source chaos tools
  • How to use chaos engineering for cloud and physical infrastructure servers
  • How to get started using Chaos Monkey

Advanced topics

  • How to get started using chaos engineering for databases (MySQL)
  • How to get started using chaos engineering for Go
  • What is intuition engineering, and how can tools like Vizceral help you create reliable distributed systems?
  • Where can you learn more?
  • How to join the chaos community
Photo of Tammy Butow

Tammy Butow

Dropbox

Tammy Butow is a site reliability engineering manager at Dropbox, where she is the team lead for the Databases and Magic Pocket SRE teams. She enjoys working on infrastructure engineering and is interested in chaos engineering, antifragile systems, automation, Go, and Linux. Previously, Tammy worked in security engineering and product engineering. She is the cofounder of Girl Geek Academy, a global movement to teach 1 million women technical skills by 2025. Girl Geek Academy received support from the Australian prime minister and a grant from the Australian government in 2016 to scale the Miss Makes Code program, which is aimed at teaching algorithms to 5- to 8-year-old girls. An Australian, Tammy currently lives in San Francisco, where she likes to ride bikes, skateboard, snowboard, and surf. She also loves mosh pits, crowd surfing, metal, and hardcore punk.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Comments

Picture of Tammy Butow
Tammy Butow | SITE RELIABILITY ENGINEERING MANAGER
06/20/2017 7:23am PDT

Thanks for attending!

Here are the slides from the workshop: https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp

Picture of Tammy Butow
Tammy Butow | SITE RELIABILITY ENGINEERING MANAGER
06/19/2017 1:36pm PDT

Hi Everyone, the link was moved, you can access “Introducing Chaos Engineering” here: https://web.archive.org/web/20170206125837/http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html. There is a more recent post here too: https://medium.com/netflix-techblog/chaos-engineering-upgraded-878d341f15fa

Jennifer Beck | SOFTWARE ENGINEER
06/19/2017 9:37am PDT

Hi Tammy, It seems that the "Suggested reading: “Introducing Chaos Engineering” is no longer under the netflix techblog.

Evandro Silvestre | SOFTWARE ENGINEERING MANAGER
06/12/2017 11:06pm PDT

Hi, I think Netflix remove the “Introducing Chaos Engineering” post from the blog :(