Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Processing fast data with Apache Spark: A tale of two APIs

Gerard Maas (Lightbend)
11:1511:55 Wednesday, 23 May 2018
Data engineering and architecture, Streaming systems and real-time applications
Location: Capital Suite 8/9 Level: Intermediate
Average rating: ****.
(4.00, 13 ratings)

Who is this presentation for?

  • Software engineers, data engineers, and enterprise architects

Prerequisite knowledge

  • Familiarity with Apache Spark and streaming applications

What you'll learn

  • Explore the capabilities of Spark's APIs for streaming and their key differences
  • Learn how to make the right choice for an application and how to architect and develop streaming pipelines that use one or both APIs to fulfill their requirements

Description

Fast data architectures provide an answer to enterprises’ increasing need to process and analyze continuous streams of data, which helps accelerate decision making and enables faster responses to changing characteristics of their market. Apache Spark is a popular framework for data analytics. Its capabilities in the streaming domain are represented by two APIs: the low-level Spark Streaming and the more declarative Structured Streaming, which builds upon the recent advances in Spark SQL query optimization and code generation.

Gerard Maas offers a critical overview of the differences in these APIs, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines.

Topics include:

  • How to get started (ease of development)
  • How to deal with time (both at the processing and event level)
  • How to deal with state (locally, distributed, and its relation to time)
  • How to migrate (functional coding strategies)
  • How to do ML (machine learning capabilities)
Photo of Gerard Maas

Gerard Maas

Lightbend

Gerard Maas is a senior software engineer at Lightbend, where he contributes to the Fast Data Platform and focuses on the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the coauthor of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker and contributes to small and large open source projects. In his free time, he tinkers with drones and builds personal IoT projects.