Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. Netflix has turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details.
Calamity, heartbreak, and inefficiency drove the company to refine its approach—and its understanding—as it has matured. Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with a brief “all is well” email.
Once you’ve decided to go multiregion, three major questions arise: How many regions do you need? How should you steer users to regions? And how do you actually perform the failover?
Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach. He also offers an overview of the design considerations and system models Netflix used to make those decisions.
Aaron Blohowiak is a senior software engineer on the traffic team at Netflix, where he is applying his passion for empiricism and system design to multiregion high-availability architecture and operations. Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 100M users at Netflix. He is the coauthor of Chaos Engineering.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com