We often think of reliability in terms of improving individual components like machines, servers, and operational processes. But in a distributed system, something is always failing, and the more parts you add, the more vectors there are for failure. While tools and optimizations can help, reaching the highest levels of reliability requires rethinking the basic design of how components behave and interact.
Astrid Atkinson discusses techniques for building systems that are resilient by design.
Astrid Atkinson is director of software engineering at Google, where she leads development frameworks. During her 10+ years at Google, Astrid has built infrastructure and managed a variety of engineering teams and spent more than five years on call for Google.com. She has led teams across the infrastructure map, from the team responsible for running and building Google’s web-serving layer to App Engine and cloud systems to core search.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com