Put open source to work
July 16–17, 2018: Training & Tutorials
July 18–19, 2018: Conference
Portland, OR

Distributed TensorFlow on Hops

Fabio Buso (Logical Clocks AB)
2:30pm3:00pm Tuesday, July 17, 2018
Location: B115-116
Tags: tensorflow

Methods that scale with computation are the future of AI. Hyperscale AI companies produce the most accurate models and train their models faster with distributed deep learning.

Fabio Buso shares the latest developments in distributed TensorFlow and shows how distribution can both massively reduce training time and enable parallel experimentation for hyperparameter optimization. You’ll explore different distributed architectures for TensorFlow, including the parameter server and “ring allreduce” models, with a focus on open source TensorFlow frameworks that leverage Apache Spark to manage distributed training, such as Yahoo’s TensorFlowOnSpark, Uber’s Horovod, and the Hops model. Fabio also covers the different programming models supported and highlights the importance of cluster support for managing GPUs as a resource. To this end, he demonstrates how Hops, an open source distribution of Hadoop with support for GPUs as a resource, can run TensorFlow applications from a Jupyter notebook using Apache Spark for distribution and walks you through an end-to-end demo for distributed TensorFlow from training to model deployment and inferencing using TensorFlow serving, using a well-known large machine learning dataset (9M images, a 1 TB extended version of ImageNet). The demo will cover important issues of how to debug, monitor, and visualize training with TensorBoard and how to deploy and use trained models for inferencing on Kubernetes.

Photo of Fabio Buso

Fabio Buso

Logical Clocks AB

Fabio Buso is the head of engineering at Logical Clocks AB, where he focuses on the machine learning service of the Hops Hadoop platform and leads the development of a scalable model serving infrastructure over Hops and Kubernetes. He is also involved in the development of a feature store for machine learning on Hops, which is integrated with the TensorFlow framework. Fabio has an international background. He holds a master’s degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin. His master’s thesis at RISE SICS AB described his implementation of a strongly consistent metastore for Apache Hive on Hops.