SRE classroom: How to design a reliable application in three hours (sponsored by Google)





Who is this presentation for?
- Site reliability engineers, engineering managers, and technical program managers
Level
Description
Outline:
Goals and expectations
- We have a problem. Let’s solve it with software
- The initial problem statement
- The service-level objective (SLO)
- Terminology and concepts
- Hardware (memory, processor)
- Software (libraries, invariants)
- Distributed
- Hardware (data center, network)
- Software (algorithms, failures)
- What is distributed consensus? Why is it important?
- Hands-on exercise: Identify the components necessary to build a working system in a single location and produce a sketch of this working system (UML not required)
The solution has limitations: Let’s improve it
- We have identified single points of failure…because things failed. The system failed. And we lost users.
- Let’s replicate this thing.
- What parts are useful to duplicate? Replicate? How do we arrange this so that we make the computers do all the work?
- How do we know that these systems are doing what we expect?
- We have performance bottlenecks
- How do we identify bottlenecks?
- Conversely, how do we know that we have removed these bottlenecks?
- How can we apply these concepts to a real piece of software?
- What limitations does this introduce?
- Hands-on exercise: Identify which components can usefully run in multiple locations; evaluate how to write an SLO (and how to apply it) and produce a system that runs in multiple data centers
Discussion and conclusions
- Present an example solution
- Discuss commonly encountered limitations
- What key points have we learned?
- How does it apply beyond this workshop?
- Assessing and evaluating third-party (i.e., cloud) systems integrating these into your design
- Hands-on exercises
For each exercise, you’ll work in small groups to apply the concepts to the problem. As Jesus Climent discusses additional aspects of distributed-systems design, you’ll apply these concepts to your in-progress solutions.
This tutorial is sponsored by Google.
Prerequisite knowledge
Suggested readings:Susan J Fowler's Microservices in Production
SRE Book
- Service Level Objectives
- Load Balancing at the Frontend
- Load Balancing in the Datacenter
- Managing Critical State: Distributed Consensus for Reliability
- Data Integrity: What You Read Is What You Wrote
Distributed systems in production environments
- The Google File System
- The Chubby Lock Service for Loosely-Coupled Distributed Systems
- Borg
- Familiarity with order-of-magnitude comparisons
What you'll learn
- Learn how to evaluate distributed systems using techniques of quantitative analysis, incrementally improve a system, identify single points of failure in a large software system, make required resource estimations to create a bill of materials
- See how a SLO fits into system design, will be able to incrementally improve a system

Jesus Climent
Jesus Climent is a senior site reliability engineer on the CRE team at Google, where he helps companies meet their reliability requirements. Previously, he spent eight years at Nokia as a system administrator and architecture engineer.

Akshay Kumar
Akshay is a Senior SRE on Cloud Bigtable, Google’s petabyte-scale NoSQL database. Before this, he’s worked as an engineer on Google Search and as an options trader at JPMorgan. He enjoys learning new things and scaling himself sub-linearly.
Comments on this page are now closed.
Premier Diamond Sponsor
Gold Sponsors
Silver Sponsors
Innovators
Exhibitors
Contact us
confreg@oreilly.com
For conference registration information and customer service
partners@oreilly.com
For more information on community discounts and trade opportunities with O’Reilly conferences
velocity@oreilly.com
For information on exhibiting or sponsoring a conference
pr@oreilly.com
For media/analyst press inquires
Comments
really appreciate , thanks for your effort
Hi 陈林平 and Ky Anh Huynh,
We are working with our legal team to make these slides available as soon as possible.
Thanks for your interest!
Is slide still not avail yet?
Hi, Jesus Climent and Akshay Kumar. Is your slide available to download? Thanks a lot.