Build & maintain complex distributed systems
October 1–2, 2017: Training
October 2–4, 2017: Tutorials & Conference
New York, NY

Health checking: A not-so-trivial task in the distributed containerized world

Alexander Rukletsov (Mesosphere)
11:35am12:15pm Tuesday, October 3, 2017
Average rating: **...
(2.88, 8 ratings)

Who is this presentation for?

  • Distributed systems engineers and DevOps engineers

Prerequisite knowledge

  • A basic understanding of containers (e.g., Docker) and cluster orchestrators (e.g., Mesos or Kubernetes)

What you'll learn

  • Understand the importance of and challenges to health checking in distributed cloud-native apps


People usually think of a health check as a simple sequence: performing a specific action and judging whether the target application is healthy based on the outcome. This becomes trickier when the application consists of multiple containers managed by a cluster orchestrator and monitored by third-party tooling. In this situation, a number of questions arise, including:

  • What entity should interpret the result? Should the reasoning about the health of a task be done locally (less context) or globally (greater overhead)?
  • How often should health status be delivered to balance excessive network overhead against an up-to-date status?
  • Should health checks be aware of environment-specific intricacies such as namespaces and software-defined networks?
  • How do you keep the overhead imposed by health checks manageable and reasonable?

Alexander Rukletsov discusses the perils of modern health checking and shares lessons learned during the revamp of the Apache Mesos health checks subsystem. Alexander explores challenges and trade-offs and offers an overview of how the modern distributed systems, such as AWS, Apache Mesos, and Kubernetes, tackle the problem of health checking, as well as alternative solutions.

Photo of Alexander Rukletsov

Alexander Rukletsov


Alex Rukletsov is an Apache committer and Mesos PMC member at Mesosphere. He loves making programs run faster, reducing the cognitive load of code, and creating the right abstractions. In a previous life, Alex segmented medical images and investigated the behavior of human vessels at several German research institutes. His areas of interests include distributed systems, object recognition, and probabilistic and heuristic algorithms.