Presented By O’Reilly and Intel AI

Beijing • New York • San Francisco • London

Put AI to Work

April 29-30, 2018: Training

April 30-May 2, 2018: Tutorials & Conference

New York, NY

Distributed DNN training: Infrastructure, challenges, and lessons learned

Kaarthik Sivashanmugam (Microsoft), Wee Hyong Tok (Microsoft)

1:45pm–2:25pm Tuesday, May 1, 2018

Implementing AI
Location: Grand Ballroom East

Who is this presentation for?

Deep learning engineers and AI platform developers

Prerequisite knowledge

Familiarity with deep learning concepts and distributed computing infrastructure

What you'll learn

Understand the infrastructure requirements for scalable and efficient distributed DNN training
Explore common challenges and how to address them

Description

Deep learning is revolutionizing a wide range of applications across various industries and in organizations of all sizes. Scalable DNN training is critical to the success of large-scale deep learning. The methodologies, tools, and infrastructure in this space are rapidly evolving. Drawing on their experiences building a multitenant, distributed DNN training infrastructure that uses familiar OSS components to execute Docker container-based deep learning workloads from hundreds of AI applications on clusters with thousands of GPUs, Kaarthik Sivashanmugam and Wee Hyong Tok share recommendations to address the common challenges in enabling scalable and efficient distributed DNN training and the lessons learned in building and operating a large-scale training infrastructure. Kaarthik and Wee Hyong introduce the challenges in distributed DNN training and provide an overview of the components that can enable distributed training on bare metal infrastructure, virtual machines, and containers. In addition, they outline practical tips for running deep learning workloads on Kubernetes clusters on Azure and explain how you can leverage deep learning toolkits (e.g., CNTK, TensorFlow) on these clusters to do distributed training.

Topics include:

Components needed to enable distributed training
Considerations for reusing the compute capacity in big data clusters for DNN training
Patterns and practices for distributed deep learning to efficiently scale to multiple nodes
Tuning for high-performing training
Building a plan to easily migrate from bare metal infrastructure to VMs to containers in the cloud

Kaarthik Sivashanmugam

Microsoft

Kaarthik is a Principal Software Engineering Manager in the AI Platform group at Microsoft. In his current role, he is leading a team of software engineers and applied scientists in implementing large scale training workloads on Azure Machine Learning service and enhancing the service to make it the best cloud-platform for data scientists and ML engineers. Prior to this work, Kaarthik was involved in the development of near real time data processing platform and GPU infrastructure for deep learning.

Website

Wee Hyong Tok

Microsoft

Wee Hyong Tok is a principal data science manager with the AI CTO Office at Microsoft, where he leads the engineering and data science team for the AI for Earth program. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his track record of leading successful engineering and data science teams has given him unique superpowers to be a trusted AI advisor to customers. Wee Hyong coauthored several books on artificial intelligence, including Predictive Analytics Using Azure Machine Learning and Doing Data Science with SQL Server. Wee Hyong holds a PhD in computer science from the National University of Singapore.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Knowledge Sponsor

Contributing Sponsors

Impact Sponsors

Premier Exhibitors

Supporting Sponsors

Community Partner

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email aisponsorships@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of AI contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com