Apache Spark 2.0 introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing applications without having to reason streaming. Structured Streaming allows you to express your streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming, and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases.
Michael Armbrust is the lead developer of the Spark SQL and Structured Streaming projects at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.
Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.