One of the major challenges that data scientists often face is that once they have trained the model, they need to deploy it at production scale. It’s widely accepted that GPUs should be used for deep learning training, due to their significant speed when compared to CPUs. However, for tasks like inference (which are not as resource heavy as training), CPUs are usually sufficient and are more attractive due to their lower cost. But when inference speed is a bottleneck, GPUs provide considerable gains both from financial and time perspectives. Coupled with containerized applications and container orchestrators like Kubernetes, it’s now possible to go from training to deployment with GPUs faster and more easily while satisfying latency and throughput goals for production grade deployments.
Mathew Salvaris and Fidan Boylu Uz offer a step-by-step guide to creating a pretrained deep learning model, packaging it in a Docker container, and deploying as a web service on a Kubernetes cluster. You’ll learn how to test and verify each step and discover the gotchas you may encounter. You’ll also explore a demo of how to make calls to the deployed service to score images on a predeployed Kubernetes cluster as well as benchmarking results that provide a rough gauge of the performance of deep learning models on GPU and CPU clusters.
The tests use two frameworks—TensorFlow (1.8) and Keras (2.1.6) with a TensorFlow (1.6) backend—for five different models:
These models were selected in order to test a wide range of networks, from small parameter efficient models such as MobileNet to large networks such as NasNetLarge. For each, a Docker image with an API for scoring images has been prepared and deployed on four different cluster configurations:
Overall, results show that the throughput scales almost linearly with the number of GPUs and that GPUs always outperform CPUs at a similar price point. Mathew and Fidan also found that the performance on GPU clusters were far more consistent than CPUs—possibly because there’s no contention for resources between the model and the web service that’s present in the CPU only deployment. These results suggest that for deep learning inference tasks that use models with high number of parameters, GPU-based deployments benefit from the lack of resource contention and provide significantly higher throughput values compared to a CPU cluster of similar cost.
The session uses notebooks that you return to later.
Mathew Salvaris is a senior data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers; a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will and where he devised models to decode human decisions in real time from the motor cortex using electroencephalography (EEG); and he held a postdoctoral position at the University of Essex’s Brain Computer Interface group and was a visiting researcher at Caltech. Mathew holds a PhD in brain-computer interfaces and an MSc in distributed artificial intelligence.
Fidan Boylu is a senior data scientist at Microsoft, where she’s responsible for successful delivery of end-to-end advanced analytic solutions. She’s also worked on a number of projects on predictive maintenance and fraud detection. Fidan has 10+ years of technical experience on data mining and business intelligence. Previously, she was a professor conducting research and teaching courses on data mining and business intelligence at the University of Connecticut. She has a number of academic publications on machine learning and optimization and their business applications and holds a PhD in decision sciences.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com