Presented By O’Reilly and Intel AI
Put AI to Work
April 29-30, 2018: Training
April 30-May 2, 2018: Tutorials & Conference
New York, NY

Distributed DNN training: Infrastructure, challenges, and lessons learned

Kaarthik Sivashanmugam (Microsoft), Wee Hyong Tok (Microsoft)
1:45pm–2:25pm Tuesday, May 1, 2018
Implementing AI
Location: Grand Ballroom East

Who is this presentation for?

  • Deep learning engineers and AI platform developers

Prerequisite knowledge

  • Familiarity with deep learning concepts and distributed computing infrastructure

What you'll learn

  • Understand the infrastructure requirements for scalable and efficient distributed DNN training
  • Explore common challenges and how to address them


Deep learning is revolutionizing a wide range of applications across various industries and in organizations of all sizes. Scalable DNN training is critical to the success of large-scale deep learning. The methodologies, tools, and infrastructure in this space are rapidly evolving. Drawing on their experiences building a multitenant, distributed DNN training infrastructure that uses familiar OSS components to execute Docker container-based deep learning workloads from hundreds of AI applications on clusters with thousands of GPUs, Kaarthik Sivashanmugam and Wee Hyong Tok share recommendations to address the common challenges in enabling scalable and efficient distributed DNN training and the lessons learned in building and operating a large-scale training infrastructure. Kaarthik and Wee Hyong introduce the challenges in distributed DNN training and provide an overview of the components that can enable distributed training on bare metal infrastructure, virtual machines, and containers. In addition, they outline practical tips for running deep learning workloads on Kubernetes clusters on Azure and explain how you can leverage deep learning toolkits (e.g., CNTK, TensorFlow) on these clusters to do distributed training.

Topics include:

  • Components needed to enable distributed training
  • Considerations for reusing the compute capacity in big data clusters for DNN training
  • Patterns and practices for distributed deep learning to efficiently scale to multiple nodes
  • Tuning for high-performing training
  • Building a plan to easily migrate from bare metal infrastructure to VMs to containers in the cloud
Photo of Kaarthik Sivashanmugam

Kaarthik Sivashanmugam


Kaarthik is a Principal Software Engineering Manager in the AI Platform group at Microsoft. In his current role, he is leading a team of software engineers and applied scientists in implementing large scale training workloads on Azure Machine Learning service and enhancing the service to make it the best cloud-platform for data scientists and ML engineers. Prior to this work, Kaarthik was involved in the development of near real time data processing platform and GPU infrastructure for deep learning.

Photo of Wee Hyong Tok

Wee Hyong Tok


Wee Hyong Tok is a principal data science manager with the AI CTO Office at Microsoft, where he leads the engineering and data science team for the AI for Earth program. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his track record of leading successful engineering and data science teams has given him unique superpowers to be a trusted AI advisor to customers. Wee Hyong coauthored several books on artificial intelligence, including Predictive Analytics Using Azure Machine Learning and Doing Data Science with SQL Server. Wee Hyong holds a PhD in computer science from the National University of Singapore.