SRE classroom: How to design a reliable application in three hours (sponsored by Google)

Jesus Climent (Google), Akshay Kumar (Google)

13:30–17:00 Tuesday, 5 November 2019

Location: Hall A6

Expo Plus Sessions, Sponsored, Systems Engineering and Architecture

Average rating:

(5.00, 3 ratings)

Who is this presentation for?

Site reliability engineers, engineering managers, and technical program managers

Level

Beginner

Description

Outline:

Goals and expectations

We have a problem. Let’s solve it with software
The initial problem statement
The service-level objective (SLO)
Terminology and concepts
Hardware (memory, processor)
Software (libraries, invariants)
Distributed
Hardware (data center, network)
Software (algorithms, failures)
What is distributed consensus? Why is it important?
Hands-on exercise: Identify the components necessary to build a working system in a single location and produce a sketch of this working system (UML not required)

The solution has limitations: Let’s improve it

We have identified single points of failure…because things failed. The system failed. And we lost users.
Let’s replicate this thing.
What parts are useful to duplicate? Replicate? How do we arrange this so that we make the computers do all the work?
How do we know that these systems are doing what we expect?
We have performance bottlenecks
How do we identify bottlenecks?
Conversely, how do we know that we have removed these bottlenecks?
How can we apply these concepts to a real piece of software?
What limitations does this introduce?
Hands-on exercise: Identify which components can usefully run in multiple locations; evaluate how to write an SLO (and how to apply it) and produce a system that runs in multiple data centers

Discussion and conclusions

Present an example solution
Discuss commonly encountered limitations
What key points have we learned?
How does it apply beyond this workshop?
Assessing and evaluating third-party (i.e., cloud) systems integrating these into your design
Hands-on exercises

For each exercise, you’ll work in small groups to apply the concepts to the problem. As Jesus Climent discusses additional aspects of distributed-systems design, you’ll apply these concepts to your in-progress solutions.

This tutorial is sponsored by Google.

Prerequisite knowledge

Suggested readings:

Susan J Fowler's Microservices in Production

SRE Book

Service Level Objectives
Load Balancing at the Frontend
Load Balancing in the Datacenter
Managing Critical State: Distributed Consensus for Reliability
Data Integrity: What You Read Is What You Wrote

The Warehouse Datacenter

Distributed systems in production environments

The Google File System
The Chubby Lock Service for Loosely-Coupled Distributed Systems
Borg

CAP Twelve Years Later

Skills and Tools

Familiarity with order-of-magnitude comparisons

What you'll learn

Learn how to evaluate distributed systems using techniques of quantitative analysis, incrementally improve a system, identify single points of failure in a large software system, make required resource estimations to create a bill of materials
See how a SLO fits into system design, will be able to incrementally improve a system

Jesus Climent

Google

Jesus Climent is a senior site reliability engineer on the CRE team at Google, where he helps companies meet their reliability requirements. Previously, he spent eight years at Nokia as a system administrator and architecture engineer.

Akshay Kumar

Google

Akshay is a Senior SRE on Cloud Bigtable, Google’s petabyte-scale NoSQL database. Before this, he’s worked as an engineer on Google Search and as an options trader at JPMorgan. He enjoys learning new things and scaling himself sub-linearly.

Comments on this page are now closed.