Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​

Jordan Hambleton (Cloudera), Guru Medasani (Domino Data Lab)
1:50pm2:30pm Wednesday, March 7, 2018
Secondary topics:  Data Integration and Data Pipelines
Average rating: ****.
(4.25, 4 ratings)

Who is this presentation for?

  • Software engineers, data engineers, data architects, and DevOps engineers

Prerequisite knowledge

  • A basic understanding of Apache Kafka log offset IDs and Apache Spark Streaming's direct approach

What you'll learn

  • Learn how to build fault-tolerant streaming applications that continually read data from Kafka


Streaming data continuously from Kafka allows users to gain insights faster, but when these pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Jordan and Guru demonstrate how Apache Spark integrates with Apache Kafka for streaming data in a distributed and scalable fashion, covering considerations and approaches for building fault-tolerant streams and detailing a few strategies of offset management to easily recover a stream and prevent data loss.

Topics include:

  • Spark and Kafka architecture
  • Offset management strategies for recoverability (e.g., Spark Streaming checkpoints, storing offsets in HBase, ZooKeeper, and Kafka, and not managing offsets at all)
  • Other production considerations (operational activities, monitoring, other failure scenarios, etc.)
Photo of Jordan Hambleton

Jordan Hambleton


Jordan Hambleton is a Consulting Manager and Senior Architect at Cloudera, where he partners with customers to build and manage scalable enterprise products on the Hadoop stack. Previously, Jordan was a member of technical staff at NetApp, where he designed and implemented the NRT operational data store that continually manages automated support for all of their customers’ production systems.

Photo of Guru Medasani

Guru Medasani

Domino Data Lab

Guru Medasani is a Data Science Architect at Domino Data Lab. He helps small and large enterprises in building efficient machine learning pipelines. Previously he was a senior solutions architect at Cloudera, where he helped customers build big data platforms and leverage technologies like Apache Hadoop and Apache spark to solve complex business problems. Some of the business applications he’s worked on include applications for collecting, storing, and processing huge amounts of machine and sensor data, image processing applications on Hadoop, machine learning models to predict consumer demand, and tools to perform advanced analytics on large volumes of data stored in Hadoop.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)


Picture of Jordan Hambleton
03/08/2018 2:25am PST

Thanks for the note Mauricio. The slides have been uploaded.

Mauricio Aristizabal | DATA ARCHITECT
03/08/2018 1:30am PST

Guys, could you post your slides here? tx