Sep 9–12, 2019

Running large-scale machine learning experiments in the cloud

Shashank Prasanna (Amazon Web Services)
1:45pm2:25pm Thursday, September 12, 2019
Location: 230 C

Who is this presentation for?

  • Data scientists, machine learning researchers, software engineers, software developers, directors of engineering, CTOs, VPs of engineering, and DevOps




Machine learning involves a lot of experimentation, there’s no question about it. Data scientists and researchers spend days, weeks, or months performing steps such as algorithm searches, model architecture searches, and hyperparameter searches, as well as exploring different validation schemes, model averaging, and others. This can be time consuming, but it’s necessary to arrive at the best-performing model for the problem at hand. Even though virtually infinite compute and storage capacity is now accessible to anyone in the cloud, most machine learning workflows still involve interactively running experiments on a single GPU instance because of the complexity of setting up, managing, and scheduling experiments at scale.

With container-based technologies such as Kubernetes, Amazon ECS, and Amazon SageMaker, data scientists and researchers can focus on designing and running experiments and let these services manage infrastructure setup, scheduling, and orchestrating the machine learning experiments. Shashank Prasanna breaks down how you can easily run large-scale machine learning experiments on CPU and GPU clusters with these services and how they compare. You’ll also learn how to manage trade-offs between time to solution and cost by scaling up or scaling back resources.

Prerequisite knowledge

  • Familiarity with machine learning workflows, one or more machine learning frameworks or toolkits, and Docker containers

What you'll learn

  • Learn how to set up machine learning containers for CPU and GPU training, how to set up and use orchestrators such as Kubernetes and Amazon ECS for large-scale machine learning training experiments, and how to introduce fault-tolerance into your machine learning experimentation pipeline
  • Evaluate different cloud offerings and choose the approach that is compatible with existing infrastructure and time to solution

Shashank Prasanna

Amazon Web Services

Shashank Prasanna is an AI and machine learning evangelist at Amazon Web Services, where he focuses on helping engineers, developers, and data scientists solve challenging problems with machine learning. Previously, he worked at NVIDIA, MathWorks (makers of MATLAB), and Oracle in product marketing and software development roles focused on machine learning products. Shashank holds an MS in electrical engineering from Arizona State University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of O'Reilly AI contacts