4–7 Nov 2019
Please log in

SRE classroom: How to design a reliable application in three hours (sponsored by Google)

Jesus Climent (Google), Akshay Kumar (Google)
13:3017:00 Tuesday, 5 November 2019
Location: Hall A6
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Site reliability engineers, engineering managers, and technical program managers





Goals and expectations

  • We have a problem. Let’s solve it with software
  • The initial problem statement
  • The service-level objective (SLO)
  • Terminology and concepts
  • Hardware (memory, processor)
  • Software (libraries, invariants)
  • Distributed
  • Hardware (data center, network)
  • Software (algorithms, failures)
  • What is distributed consensus? Why is it important?
  • Hands-on exercise: Identify the components necessary to build a working system in a single location and produce a sketch of this working system (UML not required)

The solution has limitations: Let’s improve it

  • We have identified single points of failure…because things failed. The system failed. And we lost users.
  • Let’s replicate this thing.
  • What parts are useful to duplicate? Replicate? How do we arrange this so that we make the computers do all the work?
  • How do we know that these systems are doing what we expect?
  • We have performance bottlenecks
  • How do we identify bottlenecks?
  • Conversely, how do we know that we have removed these bottlenecks?
  • How can we apply these concepts to a real piece of software?
  • What limitations does this introduce?
  • Hands-on exercise: Identify which components can usefully run in multiple locations; evaluate how to write an SLO (and how to apply it) and produce a system that runs in multiple data centers

Discussion and conclusions

  • Present an example solution
  • Discuss commonly encountered limitations
  • What key points have we learned?
  • How does it apply beyond this workshop?
  • Assessing and evaluating third-party (i.e., cloud) systems integrating these into your design
  • Hands-on exercises

For each exercise, you’ll work in ​small ​groups to apply the concepts to the problem. As Jesus Climent discusses additional aspects of distributed-systems design, you’ll apply these concepts to your in-progress solutions.

This tutorial is sponsored by Google.

Prerequisite knowledge

Suggested readings:

Susan J Fowler's Microservices in Production

SRE Book

  • Service Level Objectives
  • Load Balancing at the Frontend
  • Load Balancing in the Datacenter
  • Managing Critical State: Distributed Consensus for Reliability
  • Data Integrity: What You Read Is What You Wrote

The Warehouse Datacenter

​Distributed systems in production environments

  • The Google File System
  • ​​The Chubby Lock Service for Loosely-Coupled Distributed Systems
  • Borg

​​CAP Twelve Years Later

Skills and Tools
  • Familiarity with order-of-magnitude comparisons

What you'll learn

  • Learn how to evaluate distributed systems using techniques of quantitative analysis, incrementally improve a system, identify single points of failure in a large software system, make required resource estimations to create a bill of materials
  • See how a SLO fits into system design, will be able to incrementally improve a system
Photo of Jesus Climent

Jesus Climent


Jesus Climent is a senior site reliability engineer on the CRE team at Google, where he helps companies meet their reliability requirements. Previously, he spent eight years at Nokia as a system administrator and architecture engineer.

Photo of Akshay Kumar

Akshay Kumar


Akshay is a Senior SRE on Cloud Bigtable, Google’s petabyte-scale NoSQL database. Before this, he’s worked as an engineer on Google Search and as an options trader at JPMorgan. He enjoys learning new things and scaling himself sub-linearly.

Comments on this page are now closed.


Picture of 陈林平
陈林平 | Staff Engineer
20/11/2019 9:46 CET

really appreciate , thanks for your effort

Picture of Jesus Climent
Jesus Climent | Senior Systems Engineer
18/11/2019 17:34 CET

Hi 陈林平 and Ky Anh Huynh,

We are working with our legal team to make these slides available as soon as possible.

Thanks for your interest!

Picture of 陈林平
陈林平 | Staff Engineer
11/11/2019 3:27 CET

Is slide still not avail yet?

Ky Anh Huynh | System Engineer
6/11/2019 12:25 CET

Hi, Jesus Climent and Akshay Kumar. Is your slide available to download? Thanks a lot.

  • Oracle Cloud Infrastructure
  • Cloudflare
  • JFrog
  • Akamas
  • Aqua Security Software
  • Fastly
  • Google
  • Instana
  • JetBrains
  • LaunchDarkly
  • LightStep
  • OVHcloud
  • SignalFx
  • VictorOps
  • Wayfair
  • Blameless
  • Chronosphere
  • FusionReactor
  • humanitec
  • replex GmbH
  • StackState
  • Datadog
  • GitLab
  • Gremlin
  • StormForger
  • SysEleven GmgH
  • Vamp.io

Contact us


For conference registration information and customer service


For more information on community discounts and trade opportunities with O’Reilly conferences


For information on exhibiting or sponsoring a conference


For media/analyst press inquires