Engineer for the future of Cloud
June 10-13, 2019
San Jose, CA

Move fast and learn from incidents

Ryan Kitchens (Netflix), Lorin Hochstein (Netflix), Nora Jones (Slack)
9:00am12:30pm Tuesday, June 11, 2019
Average rating: ****.
(4.80, 10 ratings)

Who is this presentation for?

  • People operating a service

Level

Intermediate

Prerequisite knowledge

What you'll learn

  • Learn to change your mind-set from eliminating human error to learning and adapting
  • Understand how incidents unfold over time
  • Learn how to ask questions that lead to key understandings about how the incident unfolded, revealing organizational, technical, and coordination-related areas that may need attention

    Description

    Organizations successfully adapt to change by learning from incidents: developing ways to prepare, examine, discuss, and imagine them.

    Ryan Kitchens, Lorin Hochstein, and Nora Jones go beyond the traditional ways of responding to and learning from incidents to explore more effective approaches and techniques that help you build the capacity to encounter failure and manage the consequences of failure successfully.

    You’ll then have the opportunity to role-play different incident scenarios.

    Schedule

    • 9:00am–9:30am Introduction
    • 9:30am–10:30am Exercise #1
    • 10:30am–11:00am Coffee break
    • 11:00am–12:00pm Exercise #2
    • 12:00pm–12:30pm Debrief

    Outline

    Set context about what’s going to happen in the tutorial

    • Focus is on eliciting details about an incident
    • Role-playing exercises
    • Schedule

    Examples of things we can learn

    • Gaps
    • Skill transfer
    • Shared understanding

    Questions should reveal (from Etsy Debrief Facilitation Guide)

    • Cues that lead people to make observations
    • Context for assessments or judgments
    • Rationales for choices or decisions
    • Things that people know (and might assume are common knowledge)
    • People’s states of mind at the time
    • Mental models for how things “should” work
    • Factors that led people to take a specific action
    • Signals that bring people to ask for help

    Traps to avoid

    • Root cause and why
    • Human error
    • Counterfactuals

    Exercises

    Put your new skills to use during incident role-playing scenarios, where you will have the opportunity to play someone involved in the incident or an incident investigator. TAs and presenters will be there to guide the exercises.

    Photo of Ryan Kitchens

    Ryan Kitchens

    Netflix

    Ryan Kitchens is a site reliability engineer on the CORE team at Netflix, where he works on building capacity across the organization to ensure its availability and reliability. Previously, Ryan was a founding member of the SRE team at Blizzard Entertainment.

    Photo of Lorin Hochstein

    Lorin Hochstein

    Netflix

    Lorin Hochstein is a senior software engineer on the cloud operations and reliability engineering (CORE) team at Netflix, where he works on ensuring that Netflix remains available. Previously, he was the senior software engineer at SendGrid, lead architect for cloud services at Nimbis Services, computer scientist at the University of Southern California’s Information Sciences Institute, and assistant professor in the Department of Computer Science and Engineering at the University of Nebraska-Lincoln. Lorin holds a BEng in computer engineering from McGill University, an MS in electrical engineering from Boston University, and a PhD in computer science from the University of Maryland.

    Photo of Nora Jones

    Nora Jones

    Slack

    Nora Jones practices chaos engineering and human factors at Slack and is a student of human factors and systems safety at Lund University. She’s passionate about resilient software, people, and the intersection of those two worlds. She cowrote the book on chaos engineering with her teammates while working at Netflix and keynoted at AWS re:Invent in 2017 to an audience of over 40,000 people about the technical benefits and business case behind implementing chaos engineering.

    Comments on this page are now closed.

    Comments

    Picture of Zachary K. Perkins
    Zachary K. Perkins | DEVOPS SYSTEMS ENGINEER
    07/17/2019 6:07am PDT

    Hey there, I’d like to have the fake scenarios we used from the conference to try running through with my own organization. Could someone get those to me?