Put AI to Work
April 15-18, 2019
New York, NY
Please log in

Deploying deep learning models on GPU-enabled Kubernetes clusters

Mathew Salvaris (Microsoft), Fidan Boylu Uz (Microsoft)
11:05am11:45am Wednesday, April 17, 2019
Implementing AI
Location: Trianon Ballroom
Secondary topics:  Deep Learning and Machine Learning tools, Edge computing and Hardware, Platforms and infrastructure
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Data scientists, solution architects, and deep learning practitioners



Prerequisite knowledge

  • Familiarity with deep learning, Python, and Kubernetes

What you'll learn

  • Learn how to deploy deep learning models on Kubernetes, create a web service with Docker and test it locally, and create a Kubernetes cluster with GPUs and deploy the web service
  • Explore best practices for testing the model and obtaining throughput metrics
  • Understand GPU versus CPU benchmarking results that can serve as a rough guide to estimating the performance of deployed models


One of the major challenges that data scientists often face is that once they have trained the model, they need to deploy it at production scale. It’s widely accepted that GPUs should be used for deep learning training, due to their significant speed when compared to CPUs. However, for tasks like inference (which are not as resource heavy as training), CPUs are usually sufficient and are more attractive due to their lower cost. But when inference speed is a bottleneck, GPUs provide considerable gains both from financial and time perspectives. Coupled with containerized applications and container orchestrators like Kubernetes, it’s now possible to go from training to deployment with GPUs faster and more easily while satisfying latency and throughput goals for production grade deployments.

Mathew Salvaris and Fidan Boylu Uz offer a step-by-step guide to creating a pretrained deep learning model, packaging it in a Docker container, and deploying as a web service on a Kubernetes cluster. You’ll learn how to test and verify each step and discover the gotchas you may encounter. You’ll also explore a demo of how to make calls to the deployed service to score images on a predeployed Kubernetes cluster as well as benchmarking results that provide a rough gauge of the performance of deep learning models on GPU and CPU clusters.

The tests use two frameworks—TensorFlow (1.8) and Keras (2.1.6) with a TensorFlow (1.6) backend—for five different models:

  • MobileNetV2 (3.4M parameters)
  • NasNetMobile (4.2M parameters)
  • ResNet50 (23.5M parameters)
  • ResNet152 (58.1M parameters)
  • NasNetLarge (84.7M parameters)

These models were selected in order to test a wide range of networks, from small parameter efficient models such as MobileNet to large networks such as NasNetLarge. For each, a Docker image with an API for scoring images has been prepared and deployed on four different cluster configurations:

  • 1-node GPU cluster with 1 pod
  • 2-node GPU cluster with 2 pods
  • 3-node GPU cluster with 3 pods
  • 5-node CPU cluster with 35 pods

Overall, results show that the throughput scales almost linearly with the number of GPUs and that GPUs always outperform CPUs at a similar price point. Mathew and Fidan also found that the performance on GPU clusters were far more consistent than CPUs—possibly because there’s no contention for resources between the model and the web service that’s present in the CPU only deployment. These results suggest that for deep learning inference tasks that use models with high number of parameters, GPU-based deployments benefit from the lack of resource contention and provide significantly higher throughput values compared to a CPU cluster of similar cost.

The session uses notebooks that you return to later.

Photo of Mathew Salvaris

Mathew Salvaris


Mathew Salvaris is a senior data scientist at Microsoft. Previously, Mathew was a data scientist for a small startup that provided analytics for fund managers; a postdoctoral researcher at UCL’s Institute of Cognitive Neuroscience, where he worked with Patrick Haggard in the area of volition and free will and devised models to decode human decisions in real time from the motor cortex using electroencephalography (EEG); and he held a postdoctoral position at the University of Essex’s Brain Computer Interface group and was a visiting researcher at Caltech. Mathew holds a PhD in brain-computer interfaces and an MSc in distributed artificial intelligence.

Photo of Fidan Boylu Uz

Fidan Boylu Uz


Fidan Boylu Uz is a senior data scientist at Microsoft, where she’s responsible for the successful delivery of end-to-end advanced analytic solutions. She’s also worked on a number of projects on predictive maintenance and fraud detection. Fidan has 10+ years of technical experience on data mining and business intelligence. Previously, she was a professor conducting research and teaching courses on data mining and business intelligence at the University of Connecticut. She has a number of academic publications on machine learning and optimization and their business applications and holds a PhD in decision sciences.