The rise of big data has led to new demands for artificial intelligence (AI) and machine-learning (ML) systems to learn complex models with millions to billions of parameters, promising adequate capacity to digest massive datasets and offer powerful predictive analytics. In order to run AI and ML algorithms such as collaborative filtering, topic modeling, and deep learning at these scales, a distributed cluster with tens to hundreds of machines is almost always required. Because these algorithms impose high computational and communication costs, any effort to scale them out must address key engineering challenges like limited network bandwidth, uneven cluster performance, and the often-unnoticed yet crucial mathematical dependency structure hidden in these algorithms. Yet, at the same time, AI and ML algorithms present new opportunities to improve efficiency by 10x or more—for example, by exploiting their hill-climbing, error-tolerant nature to conserve network bandwidth or by performing rapid, fine-grained resource reallocation.
Qirong Ho introduces new systems for distributed AI and ML, centered around recent research and development efforts on industrial-scale solutions that scale to billions or trillions of data points and model parameters. These principles and strategies directly address the above challenges and opportunities and are grouped around a sequence of four key topics: how to distribute an ML program over a cluster; how to bridge ML computation with intermachine communication; how to perform such communication; and what should be communicated between machines. By exploring these questions, Qirong outlines the underlying characteristics unique to AI and ML programs but not typically seen in traditional computer programs before dissecting successful cases in which these principles were harnessed to design and develop high-performance distributed ML software, including the Petuum open source project for high-efficiency AI and ML on distributed clusters, the net effect of which is to lower both capital (number of machines) and operational (number of engineers and data scientists) costs for running AI and ML applications.
Qirong Ho is vice president of technology at Petuum, Inc., an adjunct assistant professor at the Singapore Management University School of Information Systems, and a former principal investigator at A*STAR’s Institute for Infocomm Research. Qirong’s research focuses on distributed cluster software systems for machine learning at big data and big model scales, with a view toward theoretical correctness and performance guarantees, as well as practical needs like robustness, programmability, and usability. Qirong also works on statistical models for large-scale network analysis and social media, including latent space models for visualization, community detection, user personalization, and interest prediction. He is a recipient of the Singapore A*STAR National Science Search Undergraduate and PhD fellowships and the KDD 2015 Doctoral Dissertation Award (runner up).
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.