SRE classroom: How to design a reliable application in three hours (sponsored by Google)
Who is this presentation for?
- Site reliability engineers, engineering managers, and technical program managers
Goals and expectations
- We have a problem. Let’s solve it with software
- The initial problem statement
- The service-level objective (SLO)
- Terminology and concepts
- Hardware (memory, processor)
- Software (libraries, invariants)
- Hardware (data center, network)
- Software (algorithms, failures)
- What is distributed consensus? Why is it important?
- Hands-on exercise: Identify the components necessary to build a working system in a single location and produce a sketch of this working system (UML not required)
The solution has limitations: Let’s improve it
- We have identified single points of failure…because things failed. The system failed. And we lost users.
- Let’s replicate this thing.
- What parts are useful to duplicate? Replicate? How do we arrange this so that we make the computers do all the work?
- How do we know that these systems are doing what we expect?
- We have performance bottlenecks
- How do we identify bottlenecks?
- Conversely, how do we know that we have removed these bottlenecks?
- How can we apply these concepts to a real piece of software?
- What limitations does this introduce?
- Hands-on exercise: Identify which components can usefully run in multiple locations; evaluate how to write an SLO (and how to apply it) and produce a system that runs in multiple data centers
Discussion and conclusions
- Present an example solution
- Discuss commonly encountered limitations
- What key points have we learned?
- How does it apply beyond this workshop?
- Assessing and evaluating third-party (i.e., cloud) systems integrating these into your design
- Hands-on exercises
For each exercise, you’ll work in small groups to apply the concepts to the problem. As Jesus Climent discusses additional aspects of distributed-systems design, you’ll apply these concepts to your in-progress solutions.
This tutorial is sponsored by Google.
Prerequisite knowledgeSuggested readings:
- Service Level Objectives
- Load Balancing at the Frontend
- Load Balancing in the Datacenter
- Managing Critical State: Distributed Consensus for Reliability
- Data Integrity: What You Read Is What You Wrote
Distributed systems in production environments
- The Google File System
- The Chubby Lock Service for Loosely-Coupled Distributed Systems
- Familiarity with order-of-magnitude comparisons
What you'll learn
- Learn how to evaluate distributed systems using techniques of quantitative analysis, incrementally improve a system, identify single points of failure in a large software system, make required resource estimations to create a bill of materials
- See how a SLO fits into system design, will be able to incrementally improve a system
Jesus Climent is a senior site reliability engineer on the CRE team at Google, where he helps companies meet their reliability requirements. Previously, he spent eight years at Nokia as a system administrator and architecture engineer.
Akshay is a Senior SRE on Cloud Bigtable, Google’s petabyte-scale NoSQL database. Before this, he’s worked as an engineer on Google Search and as an options trader at JPMorgan. He enjoys learning new things and scaling himself sub-linearly.
Comments on this page are now closed.
Premier Diamond Sponsor
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires