Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

IoT with Spark Streaming: Practical lessons from real-world use cases

Hari Shreedharan (Cloudera), Anand Iyer (Cloudera)
4:35pm–5:15pm Wednesday, 09/30/2015
IoT & Real-time
Location: 3D 02/11 Level: Intermediate
Average rating: ***..
(3.17, 6 ratings)

Over the past year, Spark Streaming has emerged as the leading platform to implement IoT and similar real-time use cases. There are successful implementations across a diverse spectrum of industries: consumer internet and mobile, to healthcare to traditional manufacturing.

We will start with a brief introduction to Spark Streaming’s micro-batch architecture for real-time stream processing. However, the primary focus of the talk will be on end-to-end architectures and use cases. We will give a walkthrough, and live demo, of an example use case that includes processing and alerting on-time series data (such as sensor data); all the way from ingestion of the time series data streams with Kafka, processing in Spark Streaming to identify egregious conditions, and sending alerts via Kafka events.

Alerting and visualization often go together. After all, when something goes wrong, the investigation entails visualizing relevant events and metrics. We will extend our architecture by showing how the time series output of Spark Streaming can be written to HBase or OpenTSDB, so that it can be served to a front end for visualization.

In addition to the above use use case, we will highlight some of the high-level operators and libraries available in Spark Streaming that make it easy to implement IoT use cases:

  • Sliding windows to identify faulty sensors, trending items, correlating data from disparate streams
  • Stateful operators to maintain user session information for personalization
  • Mllib for easy machine learning on streaming data.

We will share some pro tips for:

  • Performance tuning: checkpointing of stateful data, parallelization of data receivers, recommended data serialization formats and settings, and memory tuning
  • “Exactly Once Processing” semantics and how to achieve it.

Last, we will describe how to monitor your long-running streaming applications, and highlight some recent and upcoming improvements in monitoring.

Photo of Hari Shreedharan

Hari Shreedharan


Hari Shreedharan is a software engineer at Cloudera, an Apache Flume committer/PMC member, and a Spark contributor. He is the author of the O’Reilly Media book Using Flume.

Photo of Anand Iyer

Anand Iyer


Anand Iyer is a senior product manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are platforms for real-time streaming, Apache Spark, and tools for data ingestion into the Hadoop platform. Before joining Cloudera, Anand worked as an engineer at LinkedIn, where he applied machine-learning techniques to improve the relevance and personalization of LinkedIn’s Feed. Anand has extensive experience leveraging big data platforms to deliver products that delight customers. He holds a master’s in computer science from Stanford and a bachelor’s from the University of Arizona.

Comments on this page are now closed.


animesh banik
07/06/2015 4:07am EDT

What is the significance of IoT with Ultility sector ?