Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Cruise Control: Effortless management of Kafka clusters

Adem Efe Gencer (LinkedIn)
2:40pm3:20pm Wednesday, March 27, 2019
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Kafka users, distributed systems developers, reliability engineers, and researchers interested in scalability, performance, reliability, and fault tolerance issues

Level

Intermediate

Prerequisite knowledge

  • A basic understanding of distributed system concepts (partitioning, replication, rack awareness, etc.)

What you'll learn

  • Learn how Cruise Control achieves automated management of large-scale Kafka clusters to provide reactive and proactive mitigation via anomaly detection with self-healing, dynamic load balancing on heterogeneous clusters, and admin operations for cluster maintenance

Description

Kafka incurs significant management overhead. Growing cluster sizes, the increasing volume and diversity of user traffic, and the age of network and server components further contribute to this overhead. The resulting increase in the frequency of hardware failures and load imbalance leads to frequent service interruptions, leading to poor user experience. In particular, reactive mitigation becomes insufficient due to the impact on the other services that have a Kafka dependency. Getting near-optimal performance from such an infrastructure service, maintaining its availability in the face of cascading failures, and achieving these objectives with minimal management overhead are critical but nontrivial tasks.

Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Adam begins by outlining Cruise Control’s approach to monitoring load distribution in clusters, identifying an imbalance in them, and fixing this imbalance using replica and leadership movements. He then explains how Cruise Control detects fail-stop broker failures and SLO violations without human intervention and examines a more aggressive scenario, where Cruise Control proactively identifies and mitigates potential service disruptions.

Photo of Adem Efe Gencer

Adem Efe Gencer

LinkedIn

Adem Efe Gencer develops Apache Kafka and the ecosystem around it and supports their operation at LinkedIn. In particular, he works on the design, development, and maintenance of Cruise Control, a system for alleviating the management overhead of large-scale Kafka clusters. He actively acts as a reviewer for top-tier journals and conferences. He holds a PhD in computer science from Cornell University, where his research has focused on improving the scalability of blockchain technologies. The protocols introduced in his research were adopted by Waves Platform, Aeternity, Cypherium, Enecuum, Ergo Platform, and Legalthings and are actively being developed into other systems. His papers been cited over 500 times. He received a best student paper award in Middleware Conference.