Spark Streaming Case Studies

Paco Nathan (
Hadoop & Beyond
Location: 212
Average rating: ****.
(4.00, 10 ratings)
Slides:   1-PDF 

Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams) for “micro batch” computations on small time intervals.

In this talk we will compare several published case studies for production deployments of Spark Streaming, based on interviews with the development teams. We will also compare and contrast other approaches to streaming at scale, such as Google’s MillWheel case study, Storm at Twitter, S4 at Yahoo! and Nokia, etc.

One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc. This talk will present an open source example of integrating Spark Streaming, Spark SQL, and Tachyon within a single app for real-time machine learning updates.

Photo of Paco Nathan

Paco Nathan

O’Reilly author (Enterprise Data Workflows with Cascading and the new Just Enough Math) and a “player/coach” who’s led innovative Data teams building large-scale apps. OSS evangelist for Apache Spark (Databricks), workshop instructor (Global Data Geeks), advisor to Zettacap, Amplify Partners, The Data Guild. Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Spark, Mesos, PMML, Open Data, Cascalog, Scalding, Python for analytics, NLP.

Comments on this page are now closed.


Picture of Paco Nathan
Paco Nathan
20-11-2014 12:15 CET

One update about this talk: instead of a Streaming + SQL +Tachyon demo, this will show the new support for Python and Spark Streaming that becomes available in the upcoming Spark 1.2 release, along with more details about large production use cases.