Kafka clusters should be run with rack awareness, a balanced utilization of all the hardware resources (CPU, disk, and network), and automatic recovery from failures. When running small clusters, the above requirements can be satisfied manually (e.g., by SREs monitoring the broker resource distribution and manually reassigning replicas from hot brokers to less loaded brokers). However, doing so becomes prohibitively expensive with a large Kafka deployment. LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity, and LinkedIn has been exploring ways to do so.
There are a few challenges to be addressed:
Jiangjie Qin explains how LinkedIn addressed the above problems and shares lessons learned from operating Kafka at scale with minimum human intervention. This experience can also be applied to the management of other stateful distributed systems. While there are existing products that help balance resource utilization in a distributed system, most of these are application agnostic and perform the rebalance by migrating the entire application process. While this works well for stateless systems, it typically falls short when it comes to stateful systems (e.g., Kafka) due to the large amount of state associated with the process. LinkedIn’s solution addresses this problem by trying to understand the application and migrating only a partial state—a solution that could be useful in any stateful distributed system.
Jiangjie Qin is a software engineer on the data infrastructure team at LinkedIn, where he works on Apache Kafka. Previously, Jiangjie worked at IBM, where he managed IBM’s zSeries platform for banking clients. He is a Kafka PMC member. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org