Spark Streaming is one of the first open source projects that unified batch and stream processing in the same platform. Released in 2012, it has become one of the most popular platforms for high-volume stream processing. Over the last few years, we have observed that stream processing systems rarely operate in isolation. More often than not, they have to integrate with nonstreaming data sources, SQL workloads, machine-learning models, interactive queries, etc. These applications are not just streaming applications but more complex “continuous applications” with a wide range processing needs.
Tathagata Das explains how Spark 2.x develops the next evolution of Spark Streaming by extending DataFrames and Datasets in Spark to handle streaming data. Streaming Datasets provides a single programming abstraction for batch and streaming data and also brings support for event-time-based processing, out-of-order data, sessionization, and tight integration with nonstreaming data sources and sinks. Tathagata explores these new concepts and demonstrates how they simplify building complex continuous applications.
Tathagata Das is an Apache Spark committer and a member of the PMC. He is the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.