Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Using machine learning to simplify Kafka operations

Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)
11:00am11:40am Wednesday, March 7, 2018
Secondary topics:  Graphs and Time-series
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Big data operations team members and developers of real-time applications

Prerequisite knowledge

  • Familiarity with the big data stack

What you'll learn

  • Learn how to get the best performance, predictability, and reliability for Kafka-based applications
  • Explore a case study of how to apply recent advances in machine learning and AI to solve a real problem


Apache Kafka—an open source stream processing platform that provides a unified, high-throughput, low-latency methodology for handling real-time data feeds—is widely integrated into enterprise-level infrastructures for a variety of use cases, such as ingesting data into clusters, live data feeds to enable real-time processing, and replicating input data across clusters and data centers to provide 24×7 application uptime. While Kafka’s interface is delightfully simple, operating Kafka clusters is a challenge. It’s hard to troubleshoot misbehaving Kafka clusters. The root cause of why a real-time application is lagging may be due to an application problem (e.g., poor data partitioning or load imbalance) or due to a Kafka problem (e.g., resource exhaustion or suboptimal configuration).

All this is to say that getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. This monitoring data includes metrics from Kafka brokers, producers, consumers, and the infrastructure as well as logs from various components of the Kafka ecosystem. Shivnath and Dhruv discuss how to automatically identify the root cause for a number of Kafka-based application bottlenecks, slowdowns, and failures using improved algorithms for anomaly detection, correlation, and forecasting and how the models they trained from production Kafka clusters have enabled effective capacity planning that has led to lower costs and higher reliability compared to the guesstimates done previously.

Photo of Shivnath Babu

Shivnath Babu

Duke University | Unravel Data Systems

Shivnath Babu is an associate professor of computer science at Duke University, where his research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. He is also the chief scientist at Unravel Data Systems, the company he cofounded to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has received a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. He has given talks and distinguished lectures at many research conferences and universities worldwide. Shivnath has also spoken at industry conferences, such as the Hadoop Summit.

Photo of mdhruvg goel

mdhruvg goel


Dhruv is a PM for Microsoft Azure Data Services, focused on Open Source Analytics. Prior to working for Microsoft, Dhruv got his MBA from Wharton Business School, graduating as a Palmer Scholar. He has a BS in Computer Science and has worked as a Software Engineer at Amazon.