Who guards the guardians? Designing for resilience in cluster orchestrators

Preetha Appan (HashiCorp)
  Software engineers interested in building fault-tolerant distributed systems

  Explore failure modes ranging from network failures to entire server failures in Nomad


Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature.

Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.

Preetha Appan


Preetha Appan is a software engineer on the Nomad team at HashiCorp, most recently working on scheduler internals. Previously, she worked on various Consul features toward Consul 1.0 at HashiCorp and was an early engineer at, where she built distributed systems for search and recommendations from the ground up.