Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

The secret sauce behind LinkedIn's self-managing Kafka clusters

Jiangjie Qin (LinkedIn)

11:00am–11:40am Thursday, March 8, 2018

Data engineering and architecture
Location: 230 C

Average rating:

(4.00, 3 ratings)

Download slides (PDF)

Who is this presentation for?

Kafka users and distributed system developers and administrators

Prerequisite knowledge

A basic understanding of distributed systems (e.g., partition, replica, rack awareness, etc.) and Kafka (useful but not required)

What you'll learn

Learn how LinkedIn automates its Kafka operation at scale
Discover how to model a workload and balance a stateful distributed system at a fine granularity

Description

Kafka clusters should be run with rack awareness, a balanced utilization of all the hardware resources (CPU, disk, and network), and automatic recovery from failures. When running small clusters, the above requirements can be satisfied manually (e.g., by SREs monitoring the broker resource distribution and manually reassigning replicas from hot brokers to less loaded brokers). However, doing so becomes prohibitively expensive with a large Kafka deployment. LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity, and LinkedIn has been exploring ways to do so.

There are a few challenges to be addressed:

Monitoring the performance at a fine granularity: This is trickier than it sounds. For example, how can we decide the CPU utilization of a replica while only CPU utilization of a broker is available?
Multipurpose optimization: Because there are multiple requirements to be met, it is complicated to take all the goals into consideration. Solving such multipurpose optimization can take an extremely long time and render the approach infeasible.
Avoiding any impact on availability: It is important to ensure that normal traffic is not affected while optimizing a Kafka cluster. We may also wish to interrupt some long running optimizations if necessary.

Jiangjie Qin explains how LinkedIn addressed the above problems and shares lessons learned from operating Kafka at scale with minimum human intervention. This experience can also be applied to the management of other stateful distributed systems. While there are existing products that help balance resource utilization in a distributed system, most of these are application agnostic and perform the rebalance by migrating the entire application process. While this works well for stateless systems, it typically falls short when it comes to stateful systems (e.g., Kafka) due to the large amount of state associated with the process. LinkedIn’s solution addresses this problem by trying to understand the application and migrating only a partial state—a solution that could be useful in any stateful distributed system.

Jiangjie Qin

Jiangjie Qin is a software engineer on the data infrastructure team at LinkedIn, where he works on Apache Kafka. Previously, Jiangjie worked at IBM, where he managed IBM’s zSeries platform for banking clients. He is a Kafka PMC member. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com