Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Spark Structured Streaming for machine learning

Holden Karau (IBM), Seth Hendrickson (Cloudera)
1:50pm2:30pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D Level: Intermediate
Secondary topics:  Streaming
Average rating: ****.
(4.00, 8 ratings)

Who is this presentation for?

  • Engineers

Prerequisite knowledge

  • Knowledge of Apache Spark, including DataFrames and datasets

What you'll learn

  • Better understand both Spark ML and Structured Streaming

Description

Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson look at the current state of Structured Streaming and machine learning before walking you through creating your own streaming model. Holden and Seth will also cover how to use structured machine-learning algorithms (if they are merged by the talk). By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.

Photo of Holden Karau

Holden Karau

IBM

Holden Karau is transgender Canadian, Apache Spark committer, an active open source contributor, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science.

Photo of Seth Hendrickson

Seth Hendrickson

Cloudera

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library and one-pass elastic-net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine-learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Comments on this page are now closed.

Comments

Melanio Reyes | SENIOR SOFTWARE ENGINEER
03/24/2017 4:39am PDT

Hey where can I find the slides to this presentation? Thanks!