Engineer for the future of Cloud
June 10-13, 2019
San Jose, CA

Move fast and learn from incidents

Ryan Kitchens (Netflix), Lorin Hochstein (Netflix), Nora Jones (Slack)
9:00am12:30pm Tuesday, June 11, 2019

Who is this presentation for?

  • People operating a service

Level

Intermediate

Prerequisite knowledge

  • Experience operating a service

Materials or downloads needed in advance

  • A copy of an incident you experienced that you would like to dig further into (redacted and made generic enough that it's not spilling any company info) (optional)
  • Laptop

What you'll learn

  • Learn to change mind-set from eliminating human error to learning and adapting
  • Understand how incidents unfold over time
  • Learn how to ask questions that lead to key understandings about how the incident unfolded, revealing organizational, technical, and coordination-related areas that may need attention

    Description

    Organizations successfully adapt to change by learning from incidents: developing ways to prepare, examine, discuss, and imagine them. Ryan Kitchens, Lorin Hochstein, and Nora Jones go beyond the traditional ways of responding to and learning from incidents. They explore more effective approaches and techniques that help you build the capacity to encounter failure and manage the consequences of failure successfully.

    Outline:

    Incident preparation and response

    • Conditions of incidents
    • Incidents are unique

    Measures of success

    • SLOs and KPIs, observability
    • Alert philosophy

    Organizing incident response

    • On-call and engagement models
    • Gauging the impact
    • Areas of responsibility

    Mitigation and resolution

    • Capacity to encounter failure
    • Repair, communications, coordination, documentation

    Postincident analysis: Memorialization

    • Collecting data
    • Generating artifacts
    • Qualitative measures

    Resilience engineering concepts

    • Each necessary, but jointly sufficient
    • Hazards and risks
    • Human factors

    Learning and sharing

    • Debriefing and facilitating a learning review—who should be invited, who should present, and how should they present?
    • Publishing the data
    • Achieving engagement from the wider organization
    • Recognizing different audiences
    Photo of Ryan Kitchens

    Ryan Kitchens

    Netflix

    Ryan Kitchens is a site reliability engineer on the CORE team at Netflix, where he works on building capacity across the organization to ensure its availability and reliability. Previously, Ryan was a founding member of the SRE team at Blizzard Entertainment.

    Photo of Lorin Hochstein

    Lorin Hochstein

    Netflix

    Lorin Hochstein is a senior software engineer on the cloud operations and reliability engineering (CORE) team at Netflix, where he works on ensuring that Netflix remains available. Previously, he was the senior software engineer at SendGrid, lead architect for cloud services at Nimbis Services, computer scientist at the University of Southern California’s Information Sciences Institute, and assistant professor in the Department of Computer Science and Engineering at the University of Nebraska-Lincoln. Lorin holds a BEng in computer engineering from McGill University, an MS in electrical engineering from Boston University, and a PhD in computer science from the University of Maryland.

    Photo of Nora Jones

    Nora Jones

    Slack

    Nora Jones practices chaos engineering and human factors at Slack and is a student of human factors and systems safety at Lund University. She’s passionate about resilient software, people, and the intersection of those two worlds. She cowrote the book on chaos engineering with her teammates while working at Netflix and keynoted at AWS re:Invent in 2017 to an audience of over 40,000 people about the technical benefits and business case behind implementing chaos engineering.

    Leave a Comment or Question

    Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

    Join the conversation here (requires login)