Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Processing fast data with Apache Spark: A tale of two APIs

Gerard Maas (Lightbend)
11:20am–12:00pm Wednesday, 09/12/2018
Streaming systems & real-time applications
Location: 1E 07/08 Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data engineers, data architects, and software engineers

Prerequisite knowledge

  • Familiarity with Apache Spark (useful but not required)

What you'll learn

  • Explore Spark Streaming and Structured Streaming, the APIs for streaming data with Apache Spark
  • Learn the key differences between the two APIs and when to choose each (or both)


Fast data architectures provide an answer to the increasing need for the enterprise to process and analyze continuous streams of data, which helps accelerate decision making and enables faster responses to changing characteristics of a company’s market. Apache Spark is a popular framework for data analytics. Its capabilities in the streaming domain are represented by two APIs: the low-level Spark Streaming and the more declarative Structured Streaming, which builds upon the recent advances in Spark SQL query optimization and code generation.

Gerard Maas offers a critical overview of the differences between Spark Streaming and Structured Streaming with regard to key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities, and more. You’ll learn when to pick one over the other or combine both to implement resilient streaming pipelines.

Topics include:

  • How to get started (ease of development)
  • How to deal with time, both at the processing and event levels
  • How to deal with state, both locally and distributed, and its relation with time
  • How to migrate (functional coding strategies)
  • How to do ML (machine learning capabilities)
Photo of Gerard Maas

Gerard Maas


Gerard Maas is a senior software engineer at Lightbend, where he contributes to the Fast Data Platform and focuses on the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the coauthor of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker and contributes to small and large open source projects. In his free time, he tinkers with drones and builds personal IoT projects.