Presented By O’Reilly and Intel Nervana
Put AI to work
September 17-18, 2017: Training
September 18-20, 2017: Tutorials & Conference
San Francisco, CA

Highly dense modular acceleration clusters for deep learning

Bharadwaj Pudipeddi (NVXL Technology)
4:50pm–5:30pm Tuesday, September 19, 2017
Implementing AI
Location: Yosemite BC Level: Intermediate
Secondary topics:  Deep learning, Infrastructure

Prerequisite Knowledge

  • A basic understanding of deep learning and statistical machine learning

What you'll learn

  • Explore a new FPGA-based technology for deep learning
  • Learn how clustering accelerators can produce a whole greater than the sum of the parts


Specialized computing is increasingly popular for data center workloads that need high computational and I/O performance. This is particularly true of deep learning and machine learning, with their math libraries and primitives that are normally off-loaded to GPUs, FPGAs, and sometimes even to ASICs.

However, most accelerators available today are not scalable. They are natively isolated, relying upon higher-level frameworks (such as Apache Spark) running on CPUs for scalability. The few accelerators that can even be natively clustered (i.e., interconnected by a fabric without CPU intervention) are only topologically scalable to a limited degree. For instance, NVIDIA’s DGX-1 can go up to 8 GPUs meshed together with NVLink, a high-speed interconnect fabric. However, DGX-1 is a developer’s solution that often requires careful partitioning and manual tuning to fully exploit the clustering performance.

Bharadwaj Pudipeddi proposes a highly dense modular acceleration cluster completely disaggregated from generic servers in the data center that is specifically targeted for deep learning- and AI-related workloads. This cluster is scalable and lightweight (and devoid of Xeons) with the ability to run very deep neural networks through data and model parallelism for extreme performance. A low-level fabric minimizes data movement and supports scalability, resilience, and reconfigurability, and the software (or middleware) for accelerating a wide range of workloads is designed to seamlessly support multiple frameworks, including Caffe and TensorFlow, as well as execution frameworks such as Apache Spark.

Bharadwaj demonstrates how this modular approach accelerates the most demanding applications (including training) and how this architecture is suited for extremely deep neural networks by the virtue of avoiding unnecessary synchronization and centralized control, as would often be found in a traditional server CPU-controlled solution.

Photo of Bharadwaj Pudipeddi

Bharadwaj Pudipeddi

NVXL Technology

Bharadwaj Pudipeddi is the cofounder and CTO of NVXL, a company building a new clustered acceleration platform for deep learning, machine learning, and SQL workloads. A product entrepreneur and hardware architect, Bharadwaj previously worked at Intel and a number of startups in the areas of CPU design, high-performance fabrics, flash memory storage, and scalable computing.