Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Spark Structured Streaming for machine learning

Holden Karau (Independent), Seth Hendrickson (Cloudera)
2:05pm–2:45pm Thursday, 09/29/2016
Spark & beyond
Location: Hall 1B Level: Intermediate

Prerequisite knowledge

  • A basic knowledge of Spark (equivalent to Learning Spark)
  • What you'll learn

  • Understand the basics of Structured Streaming and streaming machine learning
  • Description

    Streaming machine learning is being integrated in Spark 2.1, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. Holden and Seth will also cover how to use structured machine-learning algorithms (if they are merged by the talk). By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.

    Photo of Holden Karau

    Holden Karau


    Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

    Photo of Seth Hendrickson

    Seth Hendrickson


    Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic net regularization in Spark’s ML library and one-pass elastic net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

    Comments on this page are now closed.


    Picture of Holden Karau
    Holden Karau
    09/30/2016 5:58am EDT

    @RamJois – yup I’ve got the slides up at now :)

    Also sorry everyone for the audio issues at the start of the talk.

    09/29/2016 10:10am EDT

    Can you share the slides.