Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

Extending Spark ML: Adding custom pipeline stages to Spark

Holden Karau (Independent)
4:15pm4:55pm Wednesday, December 6, 2017
Average rating: ****.
(4.50, 6 ratings)

Who is this presentation for?

  • Data engineers or scientists interested in machine learning and Spark

Prerequisite knowledge

  • Basic knowledge of Spark

What you'll learn

  • Gain a better understanding of Spark ML
  • Learn how to add your own ML pipeline stages


Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau introduces Spark’s ML pipelines and explains how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, such as parameter searching and pipeline persistence (with a bit more work, of course).

Even if you don’t have your own machine learning algorithms that you want to implement, you’ll gain an inside look at how the ML APIs are built, helping you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.

The examples in this talk will be presented in Scala, but any nonstandard syntax will be explained.

Photo of Holden Karau

Holden Karau


Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.