Build Systems that Drive Business
Sep 30–Oct 1, 2018: Training
Oct 1–3, 2018: Tutorials & Conference
New York, NY

Availability, latency, and cost: Withstanding regional outages

Aaron Blohowiak (Netflix)
11:35am–12:15pm Wednesday, October 3, 2018
Systems Engineering and Architecture
Location: Murray Hill Level: Intermediate
Secondary topics:  Systems Architecture & Infrastructure
Average rating: ***..
(3.00, 1 rating)

Prerequisite knowledge

  • Familiarity with cloud deployment, scaling, and DNS concepts

What you'll learn

  • Learn how Netflix operates in multiple regions at scale
  • Explore the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach

Description

Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. Netflix has turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details.

Calamity, heartbreak, and inefficiency drove the company to refine its approach—and its understanding—as it has matured. Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with a brief “all is well” email.

Once you’ve decided to go multiregion, three major questions arise: How many regions do you need? How should you steer users to regions? And how do you actually perform the failover?

Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach. He also offers an overview of the design considerations and system models Netflix used to make those decisions.

Photo of Aaron Blohowiak

Aaron Blohowiak

Netflix

Aaron Blohowiak is a senior software engineer on the traffic team at Netflix, where he is applying his passion for empiricism and system design to multiregion high-availability architecture and operations. Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 100M users at Netflix. He is the coauthor of Chaos Engineering.