Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Making Structured Streaming ready for production: Updates and future directions

Michael Armbrust (Databricks), Tathagata Das (Databricks)
11:50am12:30pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D
Secondary topics:  Streaming
Average rating: ****.
(4.29, 7 ratings)

What you'll learn

  • Explore the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases

Description

Apache Spark 2.0 introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing applications without having to reason streaming. Structured Streaming allows you to express your streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming, and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.

The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases.

Topics include:

  • Design and use of the Kafka Source
  • Support for watermarks and event-time processing
  • Support for more operations and output modes
Photo of Michael Armbrust

Michael Armbrust

Databricks

Michael Armbrust is the lead developer of the Spark SQL project at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Photo of Tathagata Das

Tathagata Das

Databricks

Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)