Presented By O’Reilly and Intel AI
Put AI to work
8-9 Oct 2018: Training
9-11 Oct 2018: Tutorials & Conference
London, UK

TonY: Native support of TensorFlow on Hadoop

Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Anthony Hsu (LinkedIn)
11:05–11:45 Wednesday, 10 October 2018
Implementing AI
Location: Westminster Suite
Secondary topics:  Deep Learning tools, Platforms and infrastructure

Who is this presentation for?

  • Data scientists and machine learning practitioners

Prerequisite knowledge

  • Basic knowledge of Hadoop and TensorFlow

What you'll learn

  • Explore TonY, a new solution that runs TensorFlow as a first-class object on Hadoop
  • Understand the pros and cons of different options on running TensorFlow on managed clusters

Description

Since TensorFlow’s initial release about two years ago, many machine learning practitioners have been training and serving TensorFlow models on standalone servers and VMs from public clouds. As TensorFlow matures, it’s becoming a pressing demand to run TensorFlow on managed clusters (Kubernetes, Mesos, Hadoop, etc.) to unify the management and utilization of compute resources.

Jonathan Hung, Keqiu Hu, and Anthony Hsu offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. The core idea is to write a new application master for TensorFlow to natively negotiate resources with YARN (Hadoop’s compute management module) to run its workers and parameter servers. This native connector, coupled with several other TonY features, aims to run TensorFlow jobs as reliably and flexibly as other first-class objects on Hadoop including MapReduce and Spark.

Distributed TensorFlow is composed of “tasks.” The task can be either a parameter server (PS) or a worker task. A PS task is used to manage shared state that is updated by a set of parallel workers, and a worker is used to execute the actual training. PSs and workers are aware of each other through a cluster spec that contains the host address and port number of every node, which is required before starting the actual TensorFlow job.

At high level, TonY is consists of three main components: the client, the application master, and the worker. The client is responsible for submitting the application to the Hadoop cluster; the application master negotiates resources with the cluster and controls the lifecycle of the application; and the workers run the tasks. TonY enables distributed TensorFlow through cluster population from TaskExecutor and TonyAM. Once the TaskExecutor service is started inside a container, each TaskExecutor registers its IP address together with a reserved port number with TonyAM. After all TaskExecutor populates their host information to the TonyAM, TonyAM populates the cluster spec to every TaskExecutor, and the TaskExecutor will set the cluster spec, job type, and job index as local environment variables and start the task command.

Compared to existing solutions in the community, implementing distributed TensorFlow in this native fashion offers a few advantages. Since data storage and processing pipelines already reside in Hadoop at many large tech firms, adding TensorFlow to the mix allows companies to consolidate computing resources and leverage existing tooling and knowledge. Developing a TensorFlow infra directly on top of YARN enables flexible fine-grained control over application behavior and offers GPU support and TensorFlow job tracking.

Jonathan, Keqiu, and Anthony walk you through TonY’s features, covering:

  • Azkaban job type: As part of TonY, they have implemented a TensorFlowJob jobtype to support running TensorJobType through Azkaban, LinkedIn’s open source workflow engine. Users can easily specify the parameters and dependencies of their TensorFlow jobs by providing a .job file. The same logic can be implemented for Apache Oozie, Airflow, and other workflow engines.
  • TensorBoard: TonY also enables each TensorFlow job to show its TensorBoard on the main tracking URL. This is done by starting TensorBoard on the chief task (the node with task_index 0) and registering with TonyAM to update its TrackingURL to the TensorBoard address.

They conclude by exploring TonY’s roadmap, including better HDFS support for file formats like Avro and ORC and a job history server to report the status and diagnosis information for finished jobs, similar to MapReduce’s and Spark’s job history servers, which would organize and trigger TensorBoard for all finished jobs.

Photo of Jonathan Hung

Jonathan Hung

LinkedIn

Jonathan Hung is a senior software engineer on the Hadoop development team at LinkedIn.

Photo of Keqiu Hu

Keqiu Hu

LinkedIn

Keqiu Hu is a staff software engineer at LinkedIn, where he’s working on LinkedIn’s big data platforms, primarily focusing on TensorFlow and Hadoop.

Photo of Anthony Hsu

Anthony Hsu

LinkedIn

Anthony Hsu is a staff software engineer on the Hadoop development team at LinkedIn, where he works on distributed TensorFlow infrastructure. Previously, he worked on Dali, LinkedIn’s dataset access layer, and Azkaban, LinkedIn’s workflow scheduler. He’s also contributed to Apache Hive and Pig.