September 26-27, 2016
New York, NY

Deep neural network model compression and an efficient inference engine

Song Han (Stanford University)
11:00am–11:40am Monday, 09/26/2016
Location: River Pavilion B
Average rating: *****
(5.00, 5 ratings)

What you'll learn

  • Learn how deep compression aids in the deployment of neural networks by reducing the storage requirement without affecting accuracy
  • Explore an energy-efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the inherent modified sparse matrix-vector multiplication
  • Description

    Neural networks are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. Song Han explains how deep compression addresses this limitation by reducing the storage requirement of neural networks without affecting their accuracy. (On the ImageNet dataset, this method reduced the storage required by AlexNet by 35x from 240 MB to 6.9 MB and VGG-16 by 49x from 552 MB to 11.3 MB, both with no loss of accuracy.) The deep compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. This also allows fitting the model into an on-chip SRAM cache rather than off-chip DRAM memory.

    Song also proposes an energy-efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the inherent modified sparse matrix-vector multiplication. When compared to CPU and GPU implementations of the DNN without compression evaluated on nine DNN benchmarks, EIE is 189x and 13x faster respectively. An EIE with processing power of 102 GOPS at only 600 mW is also 24,000x and 3,000x more energy efficient on respective CPUs and GPUs.

    Photo of Song Han

    Song Han

    Stanford University

    Song Han is a rising fifth-year PhD student at Stanford University under Bill Dally. Song’s research interests are deep learning and computer architecture; he is currently focused on improving the accuracy and efficiency of neural networks on mobile and embedded systems. Song has worked on deep compression that can compress state-of-the art CNNs by 10x–49x and compress SqueezeNet to only 470KB, which fits fully in on-chip SRAM. He proposed a DSD training flow that improved that accuracy of a wide range of neural networks and designed the EIE accelerator, an ASIC that works on the compressed model, which is 13x faster and 3000x energy efficient than TitanX GPU. Song’s work has been covered by The Next Platform, TechEmergence, Embedded Vision, and O’Reilly. His work on deep compression won the best paper award at ICLR ’16.

    Comments on this page are now closed.

    Comments

    Robert McCarter
    09/29/2016 5:55am EDT

    How did you decide which edges to prune?