Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one by one to select the best performing model. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources, that waste is compounded for each model, leading to long runtimes.
Enabling model parallelism in Spark cross-validation—allowing for more than one model to be trained and evaluated at the same time—is a better use of cluster resources. Nick Pentreath and Bryan Cutler explain how to enable this setting in Spark, discuss what effect this will have on an example ML pipeline, and share best practices to keep in mind when using this feature. They also detail ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the runtime of cross-validation for complex machine learning pipelines.
Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.
Bryan Cutler is a software engineer at IBM’s Spark Technology Center, where he works on big data analytics. He is a contributor to Apache Spark in the areas of ML, SQL, Core, and Python and a committer for the Apache Arrow project. Bryan is interested in pushing the boundaries to build high-performance tools for analytics and machine learning.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org