Engineer for the future of Cloud
June 10-13, 2019
San Jose, CA

How did things go right: Learning more from incidents

Ryan Kitchens (Netflix)
1:25pm2:10pm Wednesday, June 12, 2019
Average rating: ****.
(4.58, 12 ratings)

Who is this presentation for?

  • SREs, DevOps, production engineers, and incident managers

Level

Beginner

Prerequisite knowledge

  • Familiarity with an incident

What you'll learn

  • Learn what creates a useful and readable incident investigation
  • Understand how to ask the right questions to find contributing factors rather than stopping at a root cause
  • Explore the conditions that existed at the time that allowed an incident to occur
  • Understand which conversations to begin in your organizations to find out how normal work is successful
  • Discover strategies to ensure the pressure to learn outweighs the pressure to fix

Description

Solely learning from failure isn’t a fundamental—it’s a limitation. A look into the new view of safety, human, and organizational performance and resilience engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure but rather the presence of adaptive capacity.

Navigating a perfect storm in a world where availability is made up and the nines don’t matter requires expertise. Ryan Kitchens details more rewarding ways to approach incident investigation without overly focusing on failure prevention by asking what’s going on when it seems like nothing is happening; exploring what’s going to keep failure from being worse when it does occur; examining how teams adapt successfully when preventative techniques fail; and diving into how we should prioritize the effort to develop systems that help us safely manage the consequences of failure. These can’t be resolved by trying to explain the causes of failure and fixing remediation items. We move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from “Why did things go wrong?” to “How did things go right?”

Photo of Ryan Kitchens

Ryan Kitchens

Netflix

Ryan Kitchens is a site reliability engineer on the CORE team at Netflix, where he works on building capacity across the organization to ensure its availability and reliability. Previously, Ryan was a founding member of the SRE team at Blizzard Entertainment.