Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Model parallelism in Spark ML cross-validation

11:1511:55 Thursday, 24 May 2018
Data science and machine learning
Location: Capital Suite 10/11 Level: Beginner
Average rating: **...
(2.50, 2 ratings)

Who is this presentation for?

  • Data scientists and machine learning engineers

Prerequisite knowledge

  • Basic knowledge of Spark ML (useful but not required)

What you'll learn

  • Gain insight into best practices for scaling up ML model selection on your Spark cluster


Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one by one to select the best performing model. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources, that waste is compounded for each model, leading to long runtimes.

Enabling model parallelism in Spark cross-validation—allowing for more than one model to be trained and evaluated at the same time—is a better use of cluster resources. Nick Pentreath and Bryan Cutler explain how to enable this setting in Spark, discuss what effect this will have on an example ML pipeline, and share best practices to keep in mind when using this feature. They also detail ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the runtime of cross-validation for complex machine learning pipelines.

Photo of Nick Pentreath

Nick Pentreath


Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.



Bryan Cutler is a software engineer at IBM’s Spark Technology Center, where he works on big data analytics. He is a contributor to Apache Spark in the areas of ML, SQL, Core, and Python and a committer for the Apache Arrow project. Bryan is interested in pushing the boundaries to build high-performance tools for analytics and machine learning.