Engineer for the future of Cloud

June 10-13, 2019
San Jose, CA

Add to Your Schedule

Learning from failure: Why a total site outage can be a good thing

Alex Elman (Indeed)

11:35am–12:15pm Thursday, June 13, 2019

Production Engineering, SRE, and DevOps
Location: LL21 C/D

Average rating:

(4.44, 9 ratings)

Download slides (PDF)

Level

Intermediate

Prerequisite knowledge

Involvement in incidence response in the production environment or have held a pager for your organization's infrastructure (useful but not required)
An intermediate understanding of distributed systems architecture and resilient design patterns (useful but not required)

What you'll learn

Learn to embrace failure within your systems instead of trying to eliminate it
Gain actionable steps to bring back to your organization (e.g., improve the incident postmortem, how and when to implement common resilience patterns, and how to conduct effective resilience testing)

Description

Although an outage is a terrifying prospect, you should embrace it as an opportunity. Failure can expand and improve your understanding of your systems.

Three years ago, Indeed suffered one of the worst outages in its history. No single fault or failure caused this outage. Rather, it was a complex interaction of bugs, design decisions, capacity loss, and poor situational awareness during incident response. Indeed learned valuable lessons from this event. It identified ways to make the systems more resilient and improved the approach to the incident lifecycle within the engineering culture.

Alex Elman uses the narrative of this incident to demonstrate how a site-wide outage can inform increased resilience and reduced operational complexity. Learning from failure is a feedback loop rather than a one-off process. He applies Indeed’s outage as a practical example of what an iteration of this loop can look like. He shares with other SREs the success that has risen from this failure. Indeed hasn’t had a global site outage in the three years since this event.

Alex begins with a discussion of failure to set the stage for delivering the incident background, then discusses incident response and situational awareness. He explains conducting incident postmortems and learning from failure and designing for reliability, including resilience patterns such as circuit breaking and graceful degradation. Finally, he gets into resilience testing, running chaos tests, and closing the feedback loop, leaving some time for a question and answer session.

Alex Elman

Indeed

Alex Elman is a site reliability engineer at Indeed. He’s studied and practiced resilience engineering at Indeed for seven years with the goal of making failure within distributed systems a boring nonevent. Even after moving into a leadership role, Alex continues to carry a pager, believing that incident response is a valuable learning opportunity.

Website

Premier Diamond Sponsor

Elite Sponsors

Gold Sponsors

Silver Sponsors

Exhibitors

Innovators

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email velocity@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Velocity contacts

Information
About
Contact Us
Systems Engineering & Operations Newsletter
More Velocity Events
Diversity
Code of Conduct
Privacy Policy

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com