Many methods in machine learning are based on finding parameters that minimize some objective function. The performance and scalability of these models is highly dependent on how the parameters are optimized. Spark MLlib provides users flexibility in their ML models by exposing a number of hyperparameters that can be tuned during fitting but has not previously allowed for custom optimizers that deviate from traditional synchronous, batch-gradient solutions.
Seth Hendrickson and DB Tsai are leading development efforts within Spark MLlib to provide a new optimization API, which integrates with Spark DataFrames, that enables users to choose between built-in optimization algorithms such as L-BFGS and OWL-QN with no added code, wrap preexisting implementations, or implement their own optimizers with minimal effort. Seth and DB explain how using alternative optimization algorithms for parameter fitting can provide more expressiveness in ML models and greatly decrease training times; they provide concrete examples (with code) of several of these algorithms and share performance results to demonstrate these concepts. Along the way, they also explore how this pluggable interface can be used to scale out Spark MLlib’s GLM library to several million features, as well how this added expressiveness enabled a real-world application for recommendation systems and predicting clicks on ads.
Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic net regularization in Spark’s ML library and one-pass elastic net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.
DB Tsai is a senior research engineer working on personalized recommendation algorithms at Netflix. He’s also a member of and committer for the Apache Spark Project Management Committee (PMC). DB has implemented several algorithms, including linear Rrgression and binary/multinomial logistic regression with elastic net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Previously, he was a lead machine learning engineer at Alpine Data Labs, where he led a team to develop innovative large-scale distributed learning algorithms and contributed back to the open source Apache Spark project. DB was a PhD candidate in applied physics at Stanford University. He holds a master’s degree in electrical engineering from Stanford University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org