Engineer for the future of Cloud
June 10-13, 2019
San Jose, CA

Before, during, and after chaos: Creating foresight through a cyclic approach

Nora Jones (Slack)
11:35am12:15pm Wednesday, June 12, 2019
Building Resilient Systems
Location: LL21 C/D

Who is this presentation for?

  • Engineers (software, SRE, etc.) and leaders

Level

Intermediate

Prerequisite knowledge

  • Familiarity with chaos engineering and the benefits of the discipline
  • A basic understanding of failure-injection techniques

What you'll learn

  • Understand advanced thinking and approaches for each piece of the chaos experiments cycle—with a special focus on the time before and after the experiment
  • Learn how to improve the required skill sets and mind-sets to get the most benefit out of chaos engineering exercises

Description

There are key components of chaos engineering beyond building platforms for testing resilience and running game days, which can’t be detected by your code: understanding the concerns, ideas, and mental models of how the system is structured for each individual and learning what your organization is good at in terms of technical and human resilience. Nora Jones shares three phases to focus individualized attention and forethought on when considering chaos experiments, with a special focus on the time before and after the experiment.

Nora goes through each part of the chaos engineering cycle and answers the questions associated with that particular step in the cycle, drawing on her personal experience of what has (and hasn’t) worked while executing resilience experiments across multiple companies with vastly different business goals. Each phase of the cycle requires a different skill set and different types of roles to maximize success. Nora talks through these skill sets and mind-sets (both of which can be trained and aided) in order to make these pieces most effective.

The chaos engineering cycle

Before the experiment: How do you learn about deviations of understanding and assumptions of steady state among teammates by asking questions? Where do these differences derive from, and what do they mean? How do you define a “normal” or “good” operation? What’s the perceived value of experimenting on this piece of the system? How do you encourage people to draw out their mental models in a structured way—either through a visual representation or a structured hypothesis about how they think the system operates?

During the experiment: In order to get to the “during” process, there may be some things you decide to automate—how do you decide what these things should be? How do you determine what scale you can safely execute your experiments on? What should and should not measuring effectiveness of a resilience experimentation platform be used for? How do you separate signal from noise and determine if an error is the result of the chaos experiment or something else?

After the experiment: If the experiment found an issue: What did you learn? If the experiment didn’t find an issue: What did you learn? How do you use this information to restructure understandings and repeat?

Photo of Nora Jones

Nora Jones

Slack

Nora Jones practices chaos engineering and human factors at Slack and is a student of human factors and systems safety at Lund University. She’s passionate about resilient software, people, and the intersection of those two worlds. She cowrote the book on chaos engineering with her teammates while working at Netflix and keynoted at AWS re:Invent in 2017 to an audience of over 40,000 people about the technical benefits and business case behind implementing chaos engineering.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)