Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

The future of streaming in Spark: Structured streaming

Tathagata Das (Databricks)
11:15–11:55 Friday, 3/06/2016
Spark & beyond
Location: Capital Suite 13 Level: Intermediate
Tags: real-time
Average rating: ****.
(4.38, 8 ratings)

Prerequisite knowledge

Attendees should have an understanding of Spark basics.

Description

Spark Streaming is one of the first open source projects that unified batch and stream processing in the same platform. Released in 2012, it has become one of the most popular platforms for high-volume stream processing. Over the last few years, we have observed that stream processing systems rarely operate in isolation. More often than not, they have to integrate with nonstreaming data sources, SQL workloads, machine-learning models, interactive queries, etc. These applications are not just streaming applications but more complex “continuous applications” with a wide range processing needs.

Tathagata Das explains how Spark 2.x develops the next evolution of Spark Streaming by extending DataFrames and Datasets in Spark to handle streaming data. Streaming Datasets provides a single programming abstraction for batch and streaming data and also brings support for event-time-based processing, out-of-order data, sessionization, and tight integration with nonstreaming data sources and sinks. Tathagata explores these new concepts and demonstrates how they simplify building complex continuous applications.

Photo of Tathagata Das

Tathagata Das

Databricks

Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.