Presented By
O’Reilly + Intel AI
Put AI to Work
April 15-18, 2019
New York, NY

Deploying Deep Learning Models on GPU Enabled Kubernetes Clusters

Mathew Salvaris (Microsoft), Fidan Boylu Uz (Microsoft)
11:05am11:45am Wednesday, April 17, 2019
Implementing AI
Location: Rendezvous
Secondary topics:  Deep Learning and Machine Learning tools, Edge computing and Hardware, Platforms and infrastructure

Who is this presentation for?

Data Scientists, Solution Architects, Deep Learning Practitioners



Prerequisite knowledge

Familiarity with Deep Learning, Python and Kubernetes.

What you'll learn

• Learn how to deploy deep learning models on Kubernetes. • Understand how to create the webservice with docker container and test it locally. • Learn how to create the Kubernetes cluster with GPUs and deploy the webservice. • Discuss the best practices for testing the model and obtain throughput metrics. • Discuss GPU vs CPU benchmarking results that can serve as a rough guide to estimating the performance of deployed models.


One of the major challenges that data scientists often face is that once they have trained the model, they need to deploy it at production scale. It is widely accepted that for deep learning training, GPUs should be used due to their significant speed when compared to CPUs. However, due to their higher cost, for tasks like inference which are not as resource heavy as training, it is usually believed that CPUs are sufficient and are more attractive due to their cost savings. However, when inference speed is a bottleneck, using GPUs provide considerable gains both from financial and time perspectives. Coupled with containerized applications and container orchestrators like Kubernetes, it is now possible to go from training to deployment with GPUs faster and more easily while satisfying latency and throughput goals for production grade deployments.
In this session, we will present the steps necessary to go from a trained deep learning model to verifying, packaging it in a Docker container and deploying it on Kubernetes cluster with GPUs. We will give advice on how to test and verify each step as well as go over possible gotchas. We will refer to the notebooks found in the following repository that the audience can later use if needed. We will be showing a demo of how one can make calls to the deployed service to score images on a pre-deployed Kubernetes cluster.
We will then present some benchmarking results that will be useful as a rough gauge of the performance of deep learning models on GPU and CPU clusters. In our tests, we use two frameworks Tensorflow (1.8) and Keras (2.1.6) with Tensorflow (1.6) backend for 5 different models with network sizes which are in the order of small to large as follows:
- MobileNetV2 (3.4M parameters)
- NasNetMobile (4.2M parameters)
- ResNet50 (23.5M parameters)
- ResNet152 (58.1M parameters)
- NasNetLarge (84.7M parameters)
We selected these models since we wanted to test a wide range of networks from small parameter efficient models such as MobileNet to large networks such as NasNetLarge. For each of these models, a docker image with an API for scoring images have been prepared and deployed on four different cluster configurations:
- 1 node GPU cluster with 1 pod
- 2 node GPU cluster with 2 pods
- 3 node GPU cluster with 3 pods
- 5 node CPU cluster with 35 pods

Overall, we found that the throughput scales almost linearly with the number of GPUs and that GPUs always outperform CPUs at a similar price point. We also found that the performance on GPU clusters were far more consistent than CPU. We hypothesize that this is because there is no contention for resources between the model and the web service that is present in the CPU only deployment. It can be concluded that for deep learning inference tasks which use models with high number of parameters, GPU based deployments benefit from the lack of resource contention and provide significantly higher throughput values compared to a CPU cluster of similar cost.

Photo of Mathew Salvaris

Mathew Salvaris


Mathew Salvaris is a data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers and a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will, devising models to decode human decisions in real time from the motor cortex using electroencephalography (EEG). He also held a postdoctoral position in the University of Essex’s Brain Computer Interface Group, where he worked on BCIs for computer mouse control. Mathew holds a PhD in brain computer interfaces and an MSc in distributed artificial intelligence.

Photo of Fidan Boylu Uz

Fidan Boylu Uz


Fidan Boylu has 10+ years of technical experience on data mining and business intelligence. She holds a Ph.D in Decision Sciences and is a former professor conducting research and teaching courses on data mining and business intelligence at the University of Connecticut. She has a number of academic publications on machine learning and optimization and their business applications. She currently works as a senior data scientist at Microsoft responsible for successful delivery of end to end advanced analytic solutions. She has worked on a number of projects on predictive maintenance, fraud detection and computer vision.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)