Engineering the Future of Software
29–31 Oct 2018: Tutorials & Conference
31 Oct–1 Nov 2018: Training
London, UK

Architecting for data-driven reliability

Yaniv Aknin (Google Cloud)
10:4512:15 Tuesday, 30 October 2018
Scale
Location: Blenheim Room - Palace Suite
Secondary topics:  Anti-Pattern, Best Practice
Average rating: ***..
(3.60, 5 ratings)

Who is this presentation for?

  • System architects, site reliability engineers, CIOs, and CTOs

Prerequisite knowledge

  • Experience building or running large systems (useful but not required)

What you'll learn

  • Learn how to apply metrics and data to architectural decision making

Description

Requirements tell us what a system should do. Nonfunctional requirements tell us when the system is doing well. Architects should pay attention to nonfunctional requirements because the challenges they pose require similar solutions: billion user systems often share architectural patterns even if they do different things. Availability, performance, and scale are all forms of functional requirements, with scale typically reduced to maintain availability/performance goals even at scale.

When put in a production context and measured against the live system, common nomenclature for these types of requirements are service-level indicators and objectives (SLIs and SLOs). If used correctly, architects will find these to be invaluable for continuous design, adding confidence to tough decision making around architectural changes (akin to tests/refactoring and profiling/optimization). Alas, SLIs and SLOs are deceptively hard to use correctly. All the nines in the world won’t help if our indicators measure the wrong thing.

Yaniv Aknin details what’s useful to measure and explains why you should focus on the user. You’ll discover common technical and organizational impediments to doing just that, compare passive monitoring (aka real user monitoring) to active monitoring (aka probers), and learn the importance of segmenting your users to business-meaningful cohorts, as well as a few cohorts to pay attention to.

With the right measurements in place, you can set objectives (SLOs) to help interpret the data. Yaniv outlines behaviors that you’d want your SLOs to encourage, primarily “working” and “fast” but sometimes also “correct,” “complete,” and “durable.” Yaniv also covers good choices for objectives, explaining why neither “perfect” nor “no SLO” are useful for engineering decisions (even if management really wants either). He concludes by exploring the practicalities of interpreting latency measurements (and how latency != performance), along with choosing aggregation windows or reporting periods. You’ll leave able to design SLIs and SLOs to guide architectural decisions for a big, live system. Knuth’s adage still applies at scale: “Premature optimization is the root of all evil.”

Photo of Yaniv Aknin

Yaniv Aknin

Google Cloud

Yaniv Aknin is Google Cloud Platform’s lead for quantitative reliability. He works with product managers, developers, and fellow SREs to create availability and performance metrics that accurately model customers’ experience, then optimizes those metrics toward the right reliability/cost point. He’s been an SRE with Google since 2013, working on network infrastructure and several parts of the Google Cloud Platform. He has over two decades’ experience solving business problems in corporate, early startup, government, and nonprofit organizations. Outside of work, he enjoys travel, food, improv theater, and pop-sci, especially behavioral economics.