Architecting for data-driven reliability

Yaniv Aknin (Google Cloud)

10:45–12:15 Tuesday, 30 October 2018

Scale
Location: Blenheim Room - Palace Suite

Secondary topics: Anti-Pattern, Best Practice

Average rating:

(3.60, 5 ratings)

Who is this presentation for?

System architects, site reliability engineers, CIOs, and CTOs

Prerequisite knowledge

Experience building or running large systems (useful but not required)

What you'll learn

Learn how to apply metrics and data to architectural decision making

Description

Requirements tell us what a system should do. Nonfunctional requirements tell us when the system is doing well. Architects should pay attention to nonfunctional requirements because the challenges they pose require similar solutions: billion user systems often share architectural patterns even if they do different things. Availability, performance, and scale are all forms of functional requirements, with scale typically reduced to maintain availability/performance goals even at scale.

When put in a production context and measured against the live system, common nomenclature for these types of requirements are service-level indicators and objectives (SLIs and SLOs). If used correctly, architects will find these to be invaluable for continuous design, adding confidence to tough decision making around architectural changes (akin to tests/refactoring and profiling/optimization). Alas, SLIs and SLOs are deceptively hard to use correctly. All the nines in the world won’t help if our indicators measure the wrong thing.

Yaniv Aknin details what’s useful to measure and explains why you should focus on the user. You’ll discover common technical and organizational impediments to doing just that, compare passive monitoring (aka real user monitoring) to active monitoring (aka probers), and learn the importance of segmenting your users to business-meaningful cohorts, as well as a few cohorts to pay attention to.

With the right measurements in place, you can set objectives (SLOs) to help interpret the data. Yaniv outlines behaviors that you’d want your SLOs to encourage, primarily “working” and “fast” but sometimes also “correct,” “complete,” and “durable.” Yaniv also covers good choices for objectives, explaining why neither “perfect” nor “no SLO” are useful for engineering decisions (even if management really wants either). He concludes by exploring the practicalities of interpreting latency measurements (and how latency != performance), along with choosing aggregation windows or reporting periods. You’ll leave able to design SLIs and SLOs to guide architectural decisions for a big, live system. Knuth’s adage still applies at scale: “Premature optimization is the root of all evil.”

Yaniv Aknin

Google Cloud

Yaniv Aknin is Google Cloud Platform’s lead for quantitative reliability. He works with product managers, developers, and fellow SREs to create availability and performance metrics that accurately model customers’ experience, then optimizes those metrics toward the right reliability/cost point. He’s been an SRE with Google since 2013, working on network infrastructure and several parts of the Google Cloud Platform. He has over two decades’ experience solving business problems in corporate, early startup, government, and nonprofit organizations. Outside of work, he enjoys travel, food, improv theater, and pop-sci, especially behavioral economics.

Website

Gold Sponsors

Silver Sponsors

Exhibitor

Innovators

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email SAconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of O'Reilly Software Architecture contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com