Requirements tell us what a system should do. Non-functional requirements tell us when the system is doing well. Architects should pay attention to non-functional requirements because the challenges they pose require similar solutions: billion user systems often share architectural patterns even if they do different things. Availability, performance and scale are all forms of functional requirements, with scale typically reduced to “maintain availability/performance goals even at scale”.
When put in a production context and measured against the live system, common nomenclature for these types of requirements are Service Level Indicators/Objectives (SLIs/SLOs). If used correctly, architects will find these to be invaluable for continuous design, adding confidence to tough decision making around architectural changes (akin to tests/refactoring and profiling/optimisation). Alas, SLIs/SLOs are deceptively hard to use correctly.
All the nines in the world won’t help if our indicators (SLIs) measure the wrong thing. So we’ll start by talking about what’s useful to measure (with some good and bad examples). We’ll discuss focusing on the user and common technical and organisational impediments to doing that. We’ll compare passive monitoring (aka real user monitoring) to active monitoring (aka probers). And we’ll round off by covering the importance of segmenting your users to business-meaningful cohorts, recommending a few cohorts to pay attention too.
With the right measurements in place, we can set objectives (SLOs) to help us interpret the data. We’ll discuss behaviours that you’d want your SLOs to encourage, primarily working and fast but sometimes also correct, complete or durable. We’ll also cover good choice of objectives, explaining why neither “perfect” nor “no SLO” are useful for engineering decisions (even if management really wants either). Lastly we’ll spend some time on practicalities of interpreting latency measurements (and how latency != performance), choice of aggregation windows or reporting periods.
At the end of this talk, you’ll be able to design SLIs and SLOs to guide architectural decisions for a big, live system. Knuth’s adage still applies at scale: “premature optimisation is the root of all evil”.
Yaniv Aknin is the SRE tech lead for Google’s Cloud Services group, covering products like App Engine, Kubernetes Engine, Cloud Functions, API Infrastructure and others.
Yaniv is passionate about using reliability metrics as a tool to keep SRE groups focused on lasting engineering projects and away from tactical operational overload.
Outside of work he enjoys travelling, improv theatre and popsci, especially behavioural economics.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org