Build & Maintain Complex Distributed Systems
June 11–12, 2018: Training
June 12–14, 2018: Tutorials & Conference
San Jose, CA

How To Establish a High Severity Incident Management Program

Tammy Butow (Gremlin)
9:00am–12:30pm Tuesday, June 12, 2018
Location: LL21 A/B Level: Beginner
Secondary topics:  Resilient, Performant & Secure Distributed Systems

Prerequisite knowledge

Interest in incident management programs and improving reliability

Description

Introduction
High severity incident management is the practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems. The purpose of establishing a program is to enhance the customer experience by improving your infrastructure reliability and upskilling your team. In this session, you will learn how to establish and measure the success of your own high severity incident management program.

What is High Severity Incident Management?
The management of high severity incidents encompasses high severity incident (SEV) detection, diagnosis, mitigation, prevention, and closure. SEV prevention includes SEV review and SEV correlation.

What are SEVs?
SEV is a term used to refer to an incident, it is derived from the word severity.

Establishing your SEV program

  • What are common types of SEVs?
  • What are examples of SEVs?
  • What are SEV levels?
  • How do your resolution times impact SLOs/SLAs?
  • What is the full lifecycle of a SEV?
  • How are SEVs measured?
  • How do you create SEV levels for free and paid products?
  • How should you name SEVs?

Measuring the success of your SEV program

  • How do you ensure your team operates effectively during a SEV 0?
  • How do you setup IMOCs for success during SEV 0s?
  • How do you empower everyone in your company to record SEVs?
  • What causes SEVs?
  • How do you categorize SEVs?
  • How do you prevent SEVs from repeating?
  • How can you use Chaos Engineering to empower your teams to prevent SEVs?

A high severity incident management program is an important subset of reliability engineering, focused on assuring that a team is prepared to manage incidents. This might seem complex; however, it can greatly improve your product’s customer experience and empower you to meet SLAs. This will also empower you to be better prepared for compliance and auditing events as they arise. By following this guide you will be able to establish your own high severity incident management program and measure its success.

Photo of Tammy Butow

Tammy Butow

Gremlin

Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers. Tammy was also an IMOC (Incident Manager On-Call) at Dropbox where she was responsible for managing and resolving high severity incidents across all of Dropbox. Previously, Tammy worked in infrastructure engineering, security engineering, and product engineering. She is the co-founder of Girl Geek Academy, a global movement to teach 1 million women technical skills by 2025. Tammy is an Australian and enjoys riding bikes, skateboarding, snowboarding, and surfing. She also loves mosh pits, crowd surfing, metal, and hardcore punk.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)