Running large-scale machine learning experiments in the cloud

Shashank Prasanna (Amazon Web Services)

1:45pm–2:25pm Thursday, September 12, 2019

Location: 230 C

Implementing AI

Secondary topics: Deep Learning tools, Machine Learning

Average rating:

(3.80, 5 ratings)

Download slides (PDF)

Who is this presentation for?

Data scientists, machine learning researchers, software engineers, software developers, directors of engineering, CTOs, VPs of engineering, and DevOps

Level

Intermediate

Description

Machine learning involves a lot of experimentation; there’s no question about it. Data scientists and researchers spend days, weeks, or months performing steps such as algorithm searches, model architecture searches, and hyperparameter searches, as well as exploring different validation schemes, model averaging, and others. This can be time consuming, but it’s necessary to arrive at the best-performing model for the problem at hand. Even though virtually infinite compute and storage capacity is now accessible to anyone in the cloud, most machine learning workflows still involve interactively running experiments on a single GPU instance because of the complexity of setting up, managing, and scheduling experiments at scale.

With container-based technologies such as Kubernetes, Amazon ECS, and Amazon SageMaker, data scientists and researchers can focus on designing and running experiments and let these services manage infrastructure setup, scheduling, and orchestrating the machine learning experiments. Shashank Prasanna breaks down how you can easily run large-scale machine learning experiments on CPU and GPU clusters with these services and how they compare. You’ll also learn how to manage trade-offs between time to solution and cost by scaling up or scaling back resources.

Prerequisite knowledge

Familiarity with machine learning workflows, one or more machine learning frameworks or toolkits, and Docker containers

What you'll learn

Learn how to set up machine learning containers for CPU and GPU training, how to set up and use orchestrators such as Kubernetes and Amazon ECS for large-scale machine learning training experiments, and how to introduce fault-tolerance into your machine learning experimentation pipeline
Evaluate different cloud offerings and choose the approach that's compatible with existing infrastructure and time to solution

Shashank Prasanna

Amazon Web Services

Shashank Prasanna is a senior AI and machine learning evangelist at Amazon Web Services, where he focuses on helping engineers, developers, and data scientists solve challenging problems with machine learning. Previously, he worked at NVIDIA, MathWorks (makers of MATLAB), and Oracle in product marketing and software development roles focused on machine learning products. Shashank holds an MS in electrical engineering from Arizona State University.