Put AI to work
June 26-27, 2017: Training
June 27-29, 2017: Tutorials & Conference
New York, NY

Benchmarking deep learning inference

Sharan Narang (Baidu)
4:50pm5:30pm Wednesday, June 28, 2017
Implementing AI
Location: Grand Ballroom West Level: Beginner

Prerequisite Knowledge

  • A basic understanding of deep learning algorithms, such as those for convolutional and recurrent neural networks

What you'll learn

  • Learn about the workloads for deep learning inference and techniques used for speeding up inference


Artificial intelligence, particularly deep learning, has revolutionized many different applications over the past few years. Research community and the industry is racing toward advancing the field and realizing real-world impact. To help advance deep learning, Baidu released the open source benchmarking tool DeepBench in 2016, which measures performance on deep learning training operations on different hardware.

However, the performance characteristics of inference differ significantly from training. In order to broaden the impact of deep learning, it is important to speed up inference for deep learning algorithms. Improvement in inference times can have a significant impact on user experience in applications using deep learning.

Sharan Narang outlines the challenges in inference for deep learning models and different workloads and performance requirements for various applications. Along the way, Sharan discusses the key differences between inference and training and various techniques used to speed up deep learning inference.

Photo of Sharan Narang

Sharan Narang


Sharan Narang is a senior researcher on the systems team at Baidu’s Silicon Valley AI Lab (SVAIL), where he leads the effort to benchmark deep learning applications. He released DeepBench in 2016, an open-source benchmark that measures the performance of deep learning workloads. Sharan also focuses on research to improve the performance of deep learning models by reducing their memory and compute requirements. He has explored techniques like pruning neural network weights and reduced precision to achieve this goal. Previously, Sharan worked on next-generation mobile processors at NVIDIA.