Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

TonY: Native support of TensorFlow on Hadoop

Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
11:20am–12:00pm Thursday, 09/13/2018
Data engineering and architecture
Location: 1A 10 Level: Intermediate
Secondary topics:  Data Platforms, Deep Learning

Who is this presentation for?

  • Data scientists, data engineers, and machine learning practitioners

Prerequisite knowledge

  • Basic knowledge of Hadoop and TensorFlow

What you'll learn

  • Explore TonY, a new solution that runs TensorFlow as a first-class citizen on Hadoop
  • Understand the pros and cons of different options on running TensorFlow on managed clusters


Since TensorFlow’s initial release about two years ago, many machine learning practitioners have been training and serving TensorFlow models on standalone servers and VMs from public clouds. As TensorFlow matures, it’s becoming a pressing demand to run TensorFlow on managed clusters (Kubernetes, Mesos, Hadoop, etc.) to unify the management and utilization of compute resources.

Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. The core idea is to write a new application master for TensorFlow to natively negotiate resources with YARN (Hadoop’s compute management module) to run its workers and parameter servers. This native connector, coupled with several other TonY features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Distributed TensorFlow is composed of “tasks.” The task can be either a parameter server (PS) or a worker task. A PS task is used to manage shared state that is updated by a set of parallel workers, and a worker is used to execute the actual training. PSs and workers are aware of each other through a cluster spec that contains the host address and port number of every node, which is required before starting the actual TensorFlow job.

At high level, TonY is consists of three main components: the client, the application master, and the worker. The client is responsible for submitting the application to the Hadoop cluster; the application master negotiates resources with the cluster and controls the lifecycle of the application; and the workers run the tasks. TonY enables distributed TensorFlow through cluster population from TaskExecutor and TonyAM. Once the TaskExecutor service is started inside a container, each TaskExecutor registers its IP address together with an reserved port number with TonyAM. After all TaskExecutor populates their host information to the TonyAM, TonyAM populates the cluster spec to every TaskExecutor, and the TaskExecutor will set the cluster spec, job type, and job index as local environment variables and start the task command.

Compared to existing solutions in the community, implementing distributed TensorFlow in this native fashion offers a few advantages. Since data storage and processing pipelines already reside in Hadoop at many large tech firms, adding TensorFlow to the mix allows companies to consolidate computing resources and leverage existing tooling and knowledge. Developing a TensorFlow infra directly on top of YARN enables flexible fine-grained control over application behavior and offers GPU support and TensorFlow job tracking.

Jonathan, Keqiu, and Zhe walk you through TonY’s features, covering:

  • Azkaban job type: As part of TonY, they have implemented a TensorFlowJob jobtype to support running TensorJobType through Azkaban, LinkedIn’s open source workflow engine. Users can easily specify the parameters and dependencies of their TensorFlow jobs by providing a .job file. The same logic can be implemented for Apache Oozie, Airflow, and other workflow engines.
  • TensorBoard: TonY also enables each TensorFlow job to show its TensorBoard on the main tracking URL. This is done by starting TensorBoard on the chief task (the node with task_index 0), and registering with TonyAM to update its TrackingURL to the TensorBoard address.

They conclude by exploring TonY’s roadmap, including better HDFS support for file formats like Avro and ORC and a job history server to report the status and diagnosis information for finished jobs, similar to MapReduce’s and Spark’s job history servers, which would organize and trigger TensorBoard for all finished jobs.

Photo of Jonathan Hung

Jonathan Hung


Jonathan Hung is a senior software engineer on the Hadoop development team at LinkedIn.

Photo of Keqiu Hu

Keqiu Hu


Keqiu Hu is a staff software engineer at LinkedIn, where he’s working on LinkedIn’s big data platforms, primarily focusing on TensorFlow and Hadoop.

Photo of Zhe Zhang

Zhe Zhang


Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).