Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

So you think you can stream: Use cases and design patterns for Spark Streaming

Vida Ha (Databricks), Prakash Chockalingam (Databricks)
12:05–12:45 Thursday, 2/06/2016
Spark & beyond
Location: Capital Suite 13 Level: Intermediate
Average rating: ***..
(3.59, 17 ratings)

Prerequisite knowledge

Attendees should be familiar with Spark.


So you’ve successfully tackled big data. Now let Vida Ha and Prakash Chockalingam help you take it real time and conquer fast data. Vida and Prakash cover the most common uses cases for streaming, important streaming design patterns, and the best practices for implementing them to achieve maximum throughput and performance of your system using Spark Streaming—one of the most popular stream processing frameworks, which enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Vida and Prakash walk you through the most common use cases for Spark Streaming, common design patterns that emerge from these use cases, tips on how to avoid common pitfalls while implementing these design patterns, and performance optimization techniques.

Topics include:

  • Streaming data ingestion and ETL: building a data highway to ingest real-time data into warehouses, search engines, or data lakes
  • Monitoring and dashboarding
  • Anomaly/fraud detection with online learning: doing predictions on streams and keeping the model up to date based on new data being observed
  • Sessionization: identifying sessions based on user behavior from streams
  • Associative time-based window aggregations: how and when to use window functions efficiently to do associative aggregations and maintain running statistics from your data
  • Global aggregations with state management: maintain the most current value of a statistic for all time with a global state
  • Joining streams efficiently with static and dynamic datasets
  • Using SQL operations on stream: how to use Spark SQL on DStreams efficiently
  • How to scale out efficiently to achieve high throughput
  • Better state management with state pruning
  • Fine-tuning checkpoint interval for optimum performance
  • Efficient ways of writing to data sinks
Photo of Vida Ha

Vida Ha


Vida Ha is currently a solutions engineer at Databricks. Previously, she worked on scaling Square’s reporting analytics system. Vida first began working with distributed computing at Google, where she improved search rankings of mobile-specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.

Photo of Prakash Chockalingam

Prakash Chockalingam


Prakash Chockalingam is currently a solutions architect at Databricks, where he focuses on helping customers build their big data infrastructure, drawing on his decade-long experience with large-scale distributed systems and machine-learning infrastructure at companies including Netflix and Yahoo. Prior to joining Databricks, Prakash was with Netflix, designing and building the recommendation infrastructure that serves out millions of recommendations to users every day. His interests broadly include distributed systems and machine learning. He coauthored several publications on machine learning and computer vision research in the early stages of his career.

Comments on this page are now closed.


Deepa Rao
6/04/2016 20:16 BST

Can you please share your slides? Thanks!