Engineering the Future of Software
29–31 Oct 2018: Tutorials & Conference
31 Oct–1 Nov 2018: Training
London, UK

Architecting for Data-Driven Reliability

Yaniv Aknin (Google Cloud)
10:4512:15 Tuesday, 30 October 2018
Scale
Location: Blenheim Room - Palace Suite Level: Intermediate
Secondary topics:  Anti-Pattern, Best Practice

Who is this presentation for?

System Architects, Site Reliability Engineers, CIO/CTO

Prerequisite knowledge

No specific prerequisites, but the talk should resonate better with people who've built or ran large systems. If you ever felt the challenge of knowing "what are my users thinking", this talk may be for you.

What you'll learn

How to apply metrics and data to architectural decision making. The talk won't discuss specific tools, but it will be quite practical in the sense that the audience should walk away with concrete ideas for new measurements and reports they can use in their work.

Description

Requirements tell us what a system should do. Non-functional requirements tell us when the system is doing well. Architects should pay attention to non-functional requirements because the challenges they pose require similar solutions: billion user systems often share architectural patterns even if they do different things. Availability, performance and scale are all forms of functional requirements, with scale typically reduced to “maintain availability/performance goals even at scale”.

When put in a production context and measured against the live system, common nomenclature for these types of requirements are Service Level Indicators/Objectives (SLIs/SLOs). If used correctly, architects will find these to be invaluable for continuous design, adding confidence to tough decision making around architectural changes (akin to tests/refactoring and profiling/optimisation). Alas, SLIs/SLOs are deceptively hard to use correctly.

All the nines in the world won’t help if our indicators (SLIs) measure the wrong thing. So we’ll start by talking about what’s useful to measure (with some good and bad examples). We’ll discuss focusing on the user and common technical and organisational impediments to doing that. We’ll compare passive monitoring (aka real user monitoring) to active monitoring (aka probers). And we’ll round off by covering the importance of segmenting your users to business-meaningful cohorts, recommending a few cohorts to pay attention too.

With the right measurements in place, we can set objectives (SLOs) to help us interpret the data. We’ll discuss behaviours that you’d want your SLOs to encourage, primarily working and fast but sometimes also correct, complete or durable. We’ll also cover good choice of objectives, explaining why neither “perfect” nor “no SLO” are useful for engineering decisions (even if management really wants either). Lastly we’ll spend some time on practicalities of interpreting latency measurements (and how latency != performance), choice of aggregation windows or reporting periods.

At the end of this talk, you’ll be able to design SLIs and SLOs to guide architectural decisions for a big, live system. Knuth’s adage still applies at scale: “premature optimisation is the root of all evil”.

Photo of Yaniv Aknin

Yaniv Aknin

Google Cloud

Yaniv Aknin is the SRE tech lead for Google’s Cloud Services group, covering products like App Engine, Kubernetes Engine, Cloud Functions, API Infrastructure and others.

Yaniv is passionate about using reliability metrics as a tool to keep SRE groups focused on lasting engineering projects and away from tactical operational overload.

Outside of work he enjoys travelling, improv theatre and popsci, especially behavioural economics.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)