Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

High-efficiency systems for distributed AI and machine learning at scale

Qirong Ho (Petuum, Inc.)
1:45pm–2:25pm Thursday, December 8, 2016
Chat, machine learning, and AI
Location: Summit 1 Level: Intermediate

Prerequisite Knowledge

  • Prior experience with at least one AI or ML algorithm, especially on a distributed system such as Spark, Hadoop, etc.

What you'll learn

  • Understand how new AI and ML distributed systems can dramatically lower computational requirements for very large-scale AI and ML problems


The rise of big data has led to new demands for artificial intelligence (AI) and machine-learning (ML) systems to learn complex models with millions to billions of parameters, promising adequate capacity to digest massive datasets and offer powerful predictive analytics. In order to run AI and ML algorithms such as collaborative filtering, topic modeling, and deep learning at these scales, a distributed cluster with tens to hundreds of machines is almost always required. Because these algorithms impose high computational and communication costs, any effort to scale them out must address key engineering challenges like limited network bandwidth, uneven cluster performance, and the often-unnoticed yet crucial mathematical dependency structure hidden in these algorithms. Yet, at the same time, AI and ML algorithms present new opportunities to improve efficiency by 10x or more—for example, by exploiting their hill-climbing, error-tolerant nature to conserve network bandwidth or by performing rapid, fine-grained resource reallocation.

Qirong Ho introduces new systems for distributed AI and ML, centered around recent research and development efforts on industrial-scale solutions that scale to billions or trillions of data points and model parameters. These principles and strategies directly address the above challenges and opportunities and are grouped around a sequence of four key topics: how to distribute an ML program over a cluster; how to bridge ML computation with intermachine communication; how to perform such communication; and what should be communicated between machines. By exploring these questions, Qirong outlines the underlying characteristics unique to AI and ML programs but not typically seen in traditional computer programs before dissecting successful cases in which these principles were harnessed to design and develop high-performance distributed ML software, including the Petuum open source project for high-efficiency AI and ML on distributed clusters, the net effect of which is to lower both capital (number of machines) and operational (number of engineers and data scientists) costs for running AI and ML applications.

Photo of Qirong Ho

Qirong Ho

Petuum, Inc.

Qirong Ho is vice president of technology at Petuum, Inc., an adjunct assistant professor at the Singapore Management University School of Information Systems, and a former principal investigator at A*STAR’s Institute for Infocomm Research. Qirong’s research focuses on distributed cluster software systems for machine learning at big data and big model scales, with a view toward theoretical correctness and performance guarantees, as well as practical needs like robustness, programmability, and usability. Qirong also works on statistical models for large-scale network analysis and social media, including latent space models for visualization, community detection, user personalization, and interest prediction. He is a recipient of the Singapore A*STAR National Science Search Undergraduate and PhD fellowships and the KDD 2015 Doctoral Dissertation Award (runner up).