Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature.
Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.
Preetha Appan is a software engineer on the Nomad team at HashiCorp, most recently working on scheduler internals. Previously, she worked on various Consul features toward Consul 1.0 at HashiCorp and was an early engineer at Indeed.com, where she built distributed systems for search and recommendations from the ground up.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org