Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Making Structured Streaming ready for production: Updates and future directions

michael dddd (Databricks), Tathagata Das (Databricks)
11:50am12:30pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D
Secondary topics:  Streaming
Average rating: ****.
(4.29, 7 ratings)

What you'll learn

  • Explore the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases


Apache Spark 2.0 introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing applications without having to reason streaming. Structured Streaming allows you to express your streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming, and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.

The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases.

Topics include:

  • Design and use of the Kafka Source
  • Support for watermarks and event-time processing
  • Support for more operations and output modes
Photo of michael dddd

michael dddd


Michael Armbrust is the lead developer of the Spark SQL and Structured Streaming projects at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Photo of Tathagata Das

Tathagata Das


Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.