San Jose • New York • London

Build Systems that Drive Business

Sep 30–Oct 1, 2018: Training
Oct 1–3, 2018: Tutorials & Conference

New York, NY

SLO burn

Jamie Wilkinson (Google)

1:30pm–2:10pm Tuesday, October 2, 2018

Monitoring, Observability, and Performance
Location: Beekman/Sutton North Level: Intermediate

Secondary topics: Systems Monitoring & Orchestration

Average rating:

(3.00, 1 rating)

Download slides (PDF)

Prerequisite knowledge

Familiarity with high school math and time series-based alerting (useful but not required)

What you'll learn

Learn how to implement sustainable SLO-based alerting for systems of any size

Description

As systems grow, they get more components—and more ways to fail. The alerts of the last system’s design can slowly “boil the frog,” and all of a sudden the SRE team finds they have no time left to address scaling problems because they’re constantly firefighting. Alert fatigue sets in, and the team burns out.

Naturally, maintenance work will always increase as the system itself grows. To make alerting sustainable, instead of on cause, only page on symptom, and even then only by declaring what the acceptable threshold of symptom is—also known as the SLO (and its complement, the error budget).

Even at Google scale, many teams have yet to implement the change in their monitoring to realize SLO-based alerts. But systems don’t need to be the size of a planet to benefit from these patterns.

Jamie Wilkinson offers a brief overview of SLOs and shares a practical guide to implementing sustainable SLO-based alerting for systems of any size. Whether you’re on call for 10 machines or 10 data centers, you’ll find something of value, as Jaime—a well-rested champion of work-life balance—demonstrates how to select service objectives and construct robust and low-maintenance alerting rules, using Prometheus for a live demonstration. You’ll also explore the tooling required to help make such a system retain observability in the absence of noisy caused-based alerts, now that they’re not telling you exactly which components are failing.

Jamie Wilkinson

Google

Jamie Wilkinson is a site reliability engineer at Google. He’s a contributing author to the SRE Book and has presented on contemporary topics at prominent conferences such as Linux.conf.au, Monitorama, PuppetConf, Velocity, and SRECon. His interests began in monitoring and the automation of small installations and have continued with human factors in automation and systems maintenance on large systems. Despite his more than 15 years in the industry, he’s still trying to automate himself out of a job.

Diamond Sponsor

Platinum Sponsors

Gold Sponsors

Silver Sponsors

Innovators

Supporters

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email velocity@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Velocity contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com