Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Operating Kafka at petabyte scale

2:40pm3:20pm Wednesday, March 15, 2017
Secondary topics:  Media, Streaming

Who is this presentation for?

  • Site reliability engineers, software engineers, and DevOps engineers

Prerequisite knowledge

  • Experience operating a Kafka cluster (useful but not required)
  • Basic familiarity with capacity planning and risk management

What you'll learn

  • Gain a renewed appreciation for the robust design of Kafka
  • Learn how to cope gracefully with the possibility of inaccurate or obsolete deployment parameters


Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He’ll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. .

Michael Edwards shares experiences and lessons learned operating Kafka at this scale.

Topics include:

  • Cluster-specific goals and outcomes
  • Monitoring, alerts, and outlier detection
  • Planned and unplanned maintenance
  • Provisioning and capacity planning
  • Fun failures we have seen
  • Crises averted by Kafka’s implementation details
Photo of Michael Edwards

Michael Edwards


Michael Edwards’ idea of full stack developer extends from interaction design in big data analytics systems down to clock/data recovery in backscatter-modulated RF protocols. He’s all about scale-up, cost-down, with additional areas of focus in authentication/access control and easy-to-integrate data visualization components.