Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

The secret sauce behind LinkedIn's self-managing Kafka clusters

Jiangjie Qin (LinkedIn)
11:00am11:40am Thursday, March 8, 2018
Average rating: ****.
(4.00, 3 ratings)

Who is this presentation for?

  • Kafka users and distributed system developers and administrators

Prerequisite knowledge

  • A basic understanding of distributed systems (e.g., partition, replica, rack awareness, etc.) and Kafka (useful but not required)

What you'll learn

  • Learn how LinkedIn automates its Kafka operation at scale
  • Discover how to model a workload and balance a stateful distributed system at a fine granularity


Kafka clusters should be run with rack awareness, a balanced utilization of all the hardware resources (CPU, disk, and network), and automatic recovery from failures. When running small clusters, the above requirements can be satisfied manually (e.g., by SREs monitoring the broker resource distribution and manually reassigning replicas from hot brokers to less loaded brokers). However, doing so becomes prohibitively expensive with a large Kafka deployment. LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity, and LinkedIn has been exploring ways to do so.

There are a few challenges to be addressed:

  • Monitoring the performance at a fine granularity: This is trickier than it sounds. For example, how can we decide the CPU utilization of a replica while only CPU utilization of a broker is available?
  • Multipurpose optimization: Because there are multiple requirements to be met, it is complicated to take all the goals into consideration. Solving such multipurpose optimization can take an extremely long time and render the approach infeasible.
  • Avoiding any impact on availability: It is important to ensure that normal traffic is not affected while optimizing a Kafka cluster. We may also wish to interrupt some long running optimizations if necessary.

Jiangjie Qin explains how LinkedIn addressed the above problems and shares lessons learned from operating Kafka at scale with minimum human intervention. This experience can also be applied to the management of other stateful distributed systems. While there are existing products that help balance resource utilization in a distributed system, most of these are application agnostic and perform the rebalance by migrating the entire application process. While this works well for stateless systems, it typically falls short when it comes to stateful systems (e.g., Kafka) due to the large amount of state associated with the process. LinkedIn’s solution addresses this problem by trying to understand the application and migrating only a partial stateā€”a solution that could be useful in any stateful distributed system.

Photo of Jiangjie Qin

Jiangjie Qin


Jiangjie Qin is a software engineer on the data infrastructure team at LinkedIn, where he works on Apache Kafka. Previously, Jiangjie worked at IBM, where he managed IBM’s zSeries platform for banking clients. He is a Kafka PMC member. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.