Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Boosting Spark MLlib performance with rich optimization algorithms

Seth Hendrickson (Cloudera), DB Tsai (Netflix)
5:25pm6:05pm Wednesday, September 27, 2017
Machine Learning & Data Science, Spark & beyond
Location: 1A 08/10 Level: Advanced
Secondary topics:  Media
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Software engineers, machine learning engineers, Spark developers, and data scientists

Prerequisite knowledge

  • Familiarity with Spark MLlib and machine learning

What you'll learn

  • Learn how to leverage Spark's built-in optimization API to express a wider variety of machine learning problems and improve the training time and scalability of your models
  • Learn how to write optimization algorithms for advanced applications and understand when it is appropriate to do so


Many methods in machine learning are based on finding parameters that minimize some objective function. The performance and scalability of these models is highly dependent on how the parameters are optimized. Spark MLlib provides users flexibility in their ML models by exposing a number of hyperparameters that can be tuned during fitting but has not previously allowed for custom optimizers that deviate from traditional synchronous, batch-gradient solutions.

Seth Hendrickson and DB Tsai are leading development efforts within Spark MLlib to provide a new optimization API, which integrates with Spark DataFrames, that enables users to choose between built-in optimization algorithms such as L-BFGS and OWL-QN with no added code, wrap preexisting implementations, or implement their own optimizers with minimal effort. Seth and DB explain how using alternative optimization algorithms for parameter fitting can provide more expressiveness in ML models and greatly decrease training times; they provide concrete examples (with code) of several of these algorithms and share performance results to demonstrate these concepts. Along the way, they also explore how this pluggable interface can be used to scale out Spark MLlib’s GLM library to several million features, as well how this added expressiveness enabled a real-world application for recommendation systems and predicting clicks on ads.

Photo of Seth Hendrickson

Seth Hendrickson


Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic net regularization in Spark’s ML library and one-pass elastic net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Photo of DB Tsai

DB Tsai


DB Tsai is a senior research engineer working on personalized recommendation algorithms at Netflix. He’s also a member of and committer for the Apache Spark Project Management Committee (PMC). DB has implemented several algorithms, including linear Rrgression and binary/multinomial logistic regression with elastic net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Previously, he was a lead machine learning engineer at Alpine Data Labs, where he led a team to develop innovative large-scale distributed learning algorithms and contributed back to the open source Apache Spark project. DB was a PhD candidate in applied physics at Stanford University. He holds a master’s degree in electrical engineering from Stanford University.