Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

1:50pm–2:30pm Wednesday, March 7, 2018

Data engineering and architecture, Streaming systems and real-time applications
Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Average rating:

(4.25, 4 ratings)

Download slides (PDF)

Who is this presentation for?

Software engineers, data engineers, data architects, and DevOps engineers

Prerequisite knowledge

A basic understanding of Apache Kafka log offset IDs and Apache Spark Streaming's direct approach

What you'll learn

Learn how to build fault-tolerant streaming applications that continually read data from Kafka

Description

Streaming data continuously from Kafka allows users to gain insights faster, but when these pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Jordan and Guru demonstrate how Apache Spark integrates with Apache Kafka for streaming data in a distributed and scalable fashion, covering considerations and approaches for building fault-tolerant streams and detailing a few strategies of offset management to easily recover a stream and prevent data loss.

Topics include:

Spark and Kafka architecture
Offset management strategies for recoverability (e.g., Spark Streaming checkpoints, storing offsets in HBase, ZooKeeper, and Kafka, and not managing offsets at all)
Other production considerations (operational activities, monitoring, other failure scenarios, etc.)

Jordan Hambleton

Cloudera

Jordan Hambleton is a Consulting Manager and Senior Architect at Cloudera, where he partners with customers to build and manage scalable enterprise products on the Hadoop stack. Previously, Jordan was a member of technical staff at NetApp, where he designed and implemented the NRT operational data store that continually manages automated support for all of their customers’ production systems.

GuruDharmateja Medasani

Domino Data Lab

Guru Medasani is a Data Science Architect at Domino Data Lab. He helps small and large enterprises in building efficient machine learning pipelines. Previously he was a senior solutions architect at Cloudera, where he helped customers build big data platforms and leverage technologies like Apache Hadoop and Apache spark to solve complex business problems. Some of the business applications he’s worked on include applications for collecting, storing, and processing huge amounts of machine and sensor data, image processing applications on Hadoop, machine learning models to predict consumer demand, and tools to perform advanced analytics on large volumes of data stored in Hadoop.

Comments on this page are now closed.

Comments

Jordan Hambleton | CONSULTING MANAGER

03/08/2018 2:25am PST

Thanks for the note Mauricio. The slides have been uploaded.

Mauricio Aristizabal | DATA ARCHITECT

03/08/2018 1:30am PST

Guys, could you post your slides here? tx

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​