Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Extending Spark ML: Adding your own tools and algorithms

Holden Karau (IBM), Seth Hendrickson (Cloudera)
2:55pm3:35pm Wednesday, September 27, 2017
Data Engineering & Architecture, Spark & beyond
Location: 1A 21/22 Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Machine learning engineers and big data engineers

Prerequisite knowledge

  • A basic understanding of Spark, ideally including DataFrames

What you'll learn

  • Explore Spark ML pipeline internals
  • Learn how to implement your own Spark ML pipeline stages


Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you’ll be able to take advantage of useful meta-algorithms like parameter searching and pipeline persistence (with a bit more work, of course).

Even if you don’t have your own machine learning algorithms that you want to implement, you’ll get an inside look at how the ML APIs are built and learn how to make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.

The examples will be presented in Scala, but any nonstandard syntax will be explained.

Photo of Holden Karau

Holden Karau


Holden Karau is a transgender Canadian Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo.

Photo of Seth Hendrickson

Seth Hendrickson


Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic net regularization in Spark’s ML library and one-pass elastic net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)