Deep learning is revolutionizing a wide range of applications across various industries and in organizations of all sizes. Scalable DNN training is critical to the success of large-scale deep learning. The methodologies, tools, and infrastructure in this space are rapidly evolving. Drawing on their experiences building a multitenant, distributed DNN training infrastructure that uses familiar OSS components to execute Docker container-based deep learning workloads from hundreds of AI applications on clusters with thousands of GPUs, Kaarthik Sivashanmugam and Wee Hyong Tok share recommendations to address the common challenges in enabling scalable and efficient distributed DNN training and the lessons learned in building and operating a large-scale training infrastructure. Kaarthik and Wee Hyong introduce the challenges in distributed DNN training and provide an overview of the components that can enable distributed training on bare metal infrastructure, virtual machines, and containers. In addition, they outline practical tips for running deep learning workloads on Kubernetes clusters on Azure and explain how you can leverage deep learning toolkits (e.g., CNTK, TensorFlow) on these clusters to do distributed training.
Kaarthik is a Principal Software Engineering Manager in the AI Platform group at Microsoft. In his current role, he is leading a team of software engineers and applied scientists in implementing large scale training workloads on Azure Machine Learning service and enhancing the service to make it the best cloud-platform for data scientists and ML engineers. Prior to this work, Kaarthik was involved in the development of near real time data processing platform and GPU infrastructure for deep learning.
Wee Hyong Tok is a principal data science manager with the AI CTO Office at Microsoft, where he leads the engineering and data science team for the AI for Earth program. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his track record of leading successful engineering and data science teams has given him unique superpowers to be a trusted AI advisor to customers. Wee Hyong coauthored several books on artificial intelligence, including Predictive Analytics Using Azure Machine Learning and Doing Data Science with SQL Server. Wee Hyong holds a PhD in computer science from the National University of Singapore.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com