Since TensorFlow’s initial release about two years ago, many machine learning practitioners have been training and serving TensorFlow models on standalone servers and VMs from public clouds. As TensorFlow matures, it’s becoming a pressing demand to run TensorFlow on managed clusters (Kubernetes, Mesos, Hadoop, etc.) to unify the management and utilization of compute resources.
Jonathan Hung, Keqiu Hu, and Anthony Hsu offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. The core idea is to write a new application master for TensorFlow to natively negotiate resources with YARN (Hadoop’s compute management module) to run its workers and parameter servers. This native connector, coupled with several other TonY features, aims to run TensorFlow jobs as reliably and flexibly as other first-class objects on Hadoop including MapReduce and Spark.
Distributed TensorFlow is composed of “tasks.” The task can be either a parameter server (PS) or a worker task. A PS task is used to manage shared state that is updated by a set of parallel workers, and a worker is used to execute the actual training. PSs and workers are aware of each other through a cluster spec that contains the host address and port number of every node, which is required before starting the actual TensorFlow job.
At high level, TonY is consists of three main components: the client, the application master, and the worker. The client is responsible for submitting the application to the Hadoop cluster; the application master negotiates resources with the cluster and controls the lifecycle of the application; and the workers run the tasks. TonY enables distributed TensorFlow through cluster population from TaskExecutor and TonyAM. Once the TaskExecutor service is started inside a container, each TaskExecutor registers its IP address together with a reserved port number with TonyAM. After all TaskExecutor populates their host information to the TonyAM, TonyAM populates the cluster spec to every TaskExecutor, and the TaskExecutor will set the cluster spec, job type, and job index as local environment variables and start the task command.
Compared to existing solutions in the community, implementing distributed TensorFlow in this native fashion offers a few advantages. Since data storage and processing pipelines already reside in Hadoop at many large tech firms, adding TensorFlow to the mix allows companies to consolidate computing resources and leverage existing tooling and knowledge. Developing a TensorFlow infra directly on top of YARN enables flexible fine-grained control over application behavior and offers GPU support and TensorFlow job tracking.
Jonathan, Keqiu, and Anthony walk you through TonY’s features, covering:
They conclude by exploring TonY’s roadmap, including better HDFS support for file formats like Avro and ORC and a job history server to report the status and diagnosis information for finished jobs, similar to MapReduce’s and Spark’s job history servers, which would organize and trigger TensorBoard for all finished jobs.
Jonathan Hung is a senior software engineer on the Hadoop development team at LinkedIn.
Keqiu Hu is a staff software engineer at LinkedIn, where he is currently working on LinkedIn’s big data platforms, primarily focusing on TensorFlow and Hadoop.
Anthony Hsu is a staff software engineer on the Hadoop development team at LinkedIn, where he works on distributed TensorFlow infrastructure. Previously, he worked on Dali, LinkedIn’s dataset access layer, and Azkaban, LinkedIn’s workflow scheduler. He has also contributed to Apache Hive and Pig.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org