San Jose • New York • London

Build Systems that Drive Business

June 11–12, 2018: Training
June 12–14, 2018: Tutorials & Conference

San Jose, CA

How to establish a high-severity incident management program

Tammy Butow (Gremlin)

9:00am–12:30pm Tuesday, June 12, 2018

Location: 230 A Level: Beginner

Secondary topics: Resilient, Performant & Secure Distributed Systems

Average rating:

(4.33, 3 ratings)

What you'll learn

Learn how to establish a high-severity incident management program and measure its success

Description

High-severity incident management is the practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems in order to enhance the customer experience by improving your infrastructure reliability and upskilling your team. The management of high-severity incidents encompasses SEV (a term used to refer to an incident, derived from the word severity) detection, diagnosis, mitigation, prevention, and closure. SEV prevention includes SEV review and SEV correlation.

Tammy Butow walks you through establishing a high-severity incident management program and measuring its success. A high-severity incident management program is an important subset of reliability engineering, focused on assuring that a team is prepared to manage incidents. This might seem complex; however, it can greatly improve your product’s customer experience and empower you to meet SLAs. This will also empower you to be better prepared for compliance and auditing events as they arise.

Outline

Establishing your SEV program

Common types of SEVs
Examples of SEVs
SEV levels
How resolution times impact SLOs/SLAs
The full lifecycle of a SEV
Measuring SEVs
Creating SEV levels for free and paid products
Naming SEVs

Measuring the success of your SEV program

Ensuring your team operates effectively during a SEV 0
Setting up IMOCs for success during SEV 0s
Empowering everyone in your company to record SEVs
SEV causes
Categorizing SEVs
Preventing SEVs from repeating
Using chaos engineering to empower your teams to prevent SEVs

Tammy Butow

Gremlin

Tammy Butow is a principal SRE at Gremlin, where she works on chaos engineering—the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Previously, Tammy led SRE teams at Dropbox responsible for the databases and storage systems used by over 500 million customers and was an IMOC (incident manager on call), where she was responsible for managing and resolving high-severity incidents across the company. She has also worked in infrastructure engineering, security engineering, and product engineering. Tammy is the cofounder of Girl Geek Academy, a global movement to teach one million women technical skills by 2025. Tammy is an Australian and enjoys riding bikes, skateboarding, snowboarding, and surfing. She also loves mosh pits, crowd surfing, metal, and hardcore punk.

Website

Comments on this page are now closed.

Comments

Tammy Butow | PRINCIPAL SITE RELIABILITY ENGINEER

06/11/2018 12:22pm PDT

Looking forward to seeing everyone at the workshop tomorrow! Please bring along a laptop. You won’t need to pre-install anything for this workshop.

Diamond Sponsor

Elite Sponsors

Platinum Sponsors

Gold Sponsors

Silver Sponsors

Innovators

Exhibitors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email velocity@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Velocity contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com