There have been years of active research and development in deep learning, and organizations have begun to explore methods in which they can train and serve deep learning on a cluster in a distributed fashion. Many build a dedicated GPU HPC cluster that works well in a research or development setting, but data has to be moved consistently between clusters. There is overhead in managing the data used to train deep learning models and managing the models between research/development and production.
Dong Meng outlines the topics that need to be addressed to successfully utilize distributed deep learning, such as consistency, fault tolerance, communication, resource management, and programming libraries, and offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Along the way, Dong demonstrates a simple distributed deep learning training program and explains how to leverage pub/sub capability to build global real-time deep learning applications on NVIDIA GPUs.
For consistency, most DL libraries introduce a parameter server and worker architecture to enable synchronization. The checkpoint reload strategy has been used to provide fault tolerance. By designing the volume topology in the distributed filesystem, you can move the GPU computing closer to the data locality. This addresses possible communication congestion by bringing together your deep learning model, your data, and your applications. For resource management, Kubernetes orchestrates the containers to train and deploy deep learning models with GPUs.
You’ll learn how to utilize the converged data platform to serve as the data infrastructure to provide a distributed filesystem, key-value storage, and streams to store and build the data pipeline. With deep learning libraries like TensorFlow or Apache MXNet housed in persistent application client containers (PAAC), you can persist the model to the distributed filesystem, provide DL frameworks with full access to vast data on the distributed filesystem, and serve models to score the data coming in through streams. Furthermore, you can manage the model version and library dependencies through container images and customize the machine learning server for production.
Dong Meng is a data scientist at MapR, where he helps customers solve their business problems with big data by translating the value from customers’ data and turns it into actionable insights or machine learning products. His recent work includes integrating open source machine learning frameworks like PredictionIO and XGBoost with MapR’s platform. He also created time series QSS and deep learning QSS as a MapR service offering. Dong has several years of experience in statistical machine learning, data mining, and big data product development. Previously, he was a senior data scientist with ADP, where he built machine learning pipelines and data products for HR using payroll data to power ADP Analytics, and a staff software engineer with IBM, SPSS, where he was part of the team that built Watson analytics. During his graduate study at the Ohio State University, Dong served as research assistant, where he concentrated on compressive sensing and solving point estimation problems from a Bayesian perspective.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org