Build Systems that Drive Business
30–31 Oct 2018: Training
31 Oct–2 Nov 2018: Tutorials & Conference
London, UK

Data-driven reliability

Yaniv Aknin (Google Cloud)
14:1014:50 Thursday, 1 November 2018
DevOps and SRE
Location: Blenheim Room - Palace Suite
Secondary topics:  Resilient, Performant & Secure Distributed Systems
Average rating: ****.
(4.71, 14 ratings)

What you'll learn

  • Learn how to apply metrics and data to architectural decision making


Requirements tell us what a system should do. Nonfunctional requirements tell us when the system is doing well. When put in a production context and measured against the live system, common nomenclature for these types of requirements are service-level indicators and objectives (SLIs and SLOs). If used correctly, architects will find these to be invaluable for continuous design, adding confidence to tough decision making around architectural changes (akin to tests/refactoring and profiling/optimization). Alas, SLIs and SLOs are deceptively hard to use correctly. All the nines in the world won’t help if our indicators measure the wrong thing.

Yaniv Aknin details what’s useful to measure and explains why you should focus on the user. You’ll discover common technical and organizational impediments to doing just that, compare passive monitoring (aka real user monitoring) to active monitoring (aka probers), and learn the importance of segmenting your users to business-meaningful cohorts, as well as a few cohorts to pay attention to.

With the right measurements in place, you can set objectives (SLOs) to help interpret the data. Yaniv outlines behaviors that you’d want your SLOs to encourage, primarily “working” and “fast” but sometimes also “correct,” “complete,” and “durable.” Yaniv also covers good choices for objectives, explaining why neither “perfect” nor “no SLO” are useful for engineering decisions (even if management really wants either). He concludes by exploring the practicalities of interpreting latency measurements (and how latency != performance), along with choosing aggregation windows or reporting periods. You’ll leave able to design SLIs and SLOs to guide architectural decisions for a big, live system—and make those nines work hard for your users.

Photo of Yaniv Aknin

Yaniv Aknin

Google Cloud

Yaniv Aknin is Google Cloud Platform’s lead for quantitative reliability. He works with product managers, developers, and fellow SREs to create availability and performance metrics that accurately model customers’ experience, then optimizes those metrics toward the right reliability/cost point. He’s been an SRE with Google since 2013, working on network infrastructure and several parts of the Google Cloud Platform. He has over two decades’ experience solving business problems in corporate, early startup, government, and nonprofit organizations. Outside of work, he enjoys travel, food, improv theater, and pop-sci, especially behavioral economics.