Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Model parallelism in Spark ML cross-validation

Nick Pentreath (IBM), BRYAN CUTLER (IBM)

11:15–11:55 Thursday, 24 May 2018

Data science and machine learning
Location: Capital Suite 10/11 Level: Beginner

Average rating:

(2.50, 2 ratings)

Who is this presentation for?

Data scientists and machine learning engineers

Prerequisite knowledge

Basic knowledge of Spark ML (useful but not required)

What you'll learn

Gain insight into best practices for scaling up ML model selection on your Spark cluster

Description

Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one by one to select the best performing model. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources, that waste is compounded for each model, leading to long runtimes.

Enabling model parallelism in Spark cross-validation—allowing for more than one model to be trained and evaluated at the same time—is a better use of cluster resources. Nick Pentreath and Bryan Cutler explain how to enable this setting in Spark, discuss what effect this will have on an example ML pipeline, and share best practices to keep in mind when using this feature. They also detail ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the runtime of cross-validation for complex machine learning pipelines.

Nick Pentreath

IBM

Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

BRYAN CUTLER

IBM

Bryan Cutler is a software engineer at IBM’s Spark Technology Center, where he works on big data analytics. He is a contributor to Apache Spark in the areas of ML, SQL, Core, and Python and a committer for the Apache Arrow project. Bryan is interested in pushing the boundaries to build high-performance tools for analytics and machine learning.

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com