Engineering the Future of Software
29–31 Oct 2018: Tutorials & Conference
31 Oct–1 Nov 2018: Training
London, UK

Architecting for data-driven reliability

Yaniv Aknin (Google Cloud)
10:4512:15 Tuesday, 30 October 2018
Location: Blenheim Room - Palace Suite Level: Intermediate
Secondary topics:  Anti-Pattern, Best Practice
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • System architects, site reliability engineers, CIOs, and CTOs

Prerequisite knowledge

  • Experience building or running large systems (useful but not required)

What you'll learn

  • Learn how to apply metrics and data to architectural decision making


Requirements tell us what a system should do. Nonfunctional requirements tell us when the system is doing well. Architects should pay attention to nonfunctional requirements because the challenges they pose require similar solutions: billion user systems often share architectural patterns even if they do different things. Availability, performance, and scale are all forms of functional requirements, with scale typically reduced to maintain availability/performance goals even at scale.

When put in a production context and measured against the live system, common nomenclature for these types of requirements are service-level indicators and objectives (SLIs and SLOs). If used correctly, architects will find these to be invaluable for continuous design, adding confidence to tough decision making around architectural changes (akin to tests/refactoring and profiling/optimization). Alas, SLIs and SLOs are deceptively hard to use correctly. All the nines in the world won’t help if our indicators measure the wrong thing.

Yaniv Aknin details what’s useful to measure and explains why you should focus on the user. You’ll discover common technical and organizational impediments to doing just that, compare passive monitoring (aka real user monitoring) to active monitoring (aka probers), and learn the importance of segmenting your users to business-meaningful cohorts, as well as a few cohorts to pay attention to.

With the right measurements in place, you can set objectives (SLOs) to help interpret the data. Yaniv outlines behaviors that you’d want your SLOs to encourage, primarily “working” and “fast” but sometimes also “correct,” “complete,” and “durable.” Yaniv also covers good choices for objectives, explaining why neither “perfect” nor “no SLO” are useful for engineering decisions (even if management really wants either). He concludes by exploring the practicalities of interpreting latency measurements (and how latency != performance), along with choosing aggregation windows or reporting periods. You’ll leave able to design SLIs and SLOs to guide architectural decisions for a big, live system. Knuth’s adage still applies at scale: “Premature optimization is the root of all evil.”

Photo of Yaniv Aknin

Yaniv Aknin

Google Cloud

Yaniv Aknin is the SRE tech lead for Google’s Cloud Services Group, covering products like App Engine, Kubernetes Engine, Cloud Functions, API infrastructure, and others. Yaniv is passionate about reliability metrics as a tool to keep SRE groups focused on lasting engineering projects and away from tactical operational overload. Outside of work, he enjoys travel, food, improv theater, and popsci, especially behavioral economics.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)