October 28–31, 2019
Please log in

HARP: An efficient and elastic GPU-sharing system

Pengfei Fan (Alibaba), Lingling Jin (Alibaba)
2:30pm3:10pm Wednesday, October 30, 2019
Location: Grand Ballroom A/B

Who is this presentation for?

  • Engineers who want to improve the use of GPUs in R&D environments


New to TensorFlow


Many TensorFlow users buy GPUs to accelerate workloads. However, the GPU use in AI clusters is generally very low for various reasons. In the R&D environment, users who requested GPU instances spend much of their time on coding without running any GPU workloads on the servers. This is a great waste of expensive GPU resources.

Pengfei Fan and Lingling Jin offer an overview of an efficient and elastic GPU-sharing system that solves this problem. It detects GPU API calls and allocates GPU resources when necessary and automatically retrieves them when no workloads are running. Combining this scheme with Kubernetes, it’s possible to successfully run coding and editing on CPU pods as well as debugging and execution on a remote GPU instance. This elastic system drastically improves GPU cluster use.

Since the system forwards GPU API calls to a remote GPU server, Pengfei and Lingling introduced extra latency in application execution. To mitigate the performance issue, they made a few optimizations to make TensorFlow run more efficiently on the system. Considering TensorFlow does memcpy and GPU kernel launching asynchronously, they changed important CUDA APIs’ behavior slightly and kept their functions correct in their virtualization layer to make local CPU and remote GPU run asynchronously. This approach hides substantial network latency, and they obtained 2x+ speedup. They also modified the TensorFlow framework to use additional CUDA streams in remote execution, and it showed more performance gain on the system than local-running mode, which also uses multiple CUDA streams. Changing the graph partitioning algorithm between CPU and GPU nodes to minimize the data movement between the CPU and remote GPU server also brought benefits in some cases. Since remote storage is used in their system, they also use GPU to direct-access remote SSDs to avoid data getting copied to CPU nodes.

Building such an elastic GPU platform also demands a modified set of GPU monitoring and debugging software. Their system includes a powerful profiling part that can collect profiling data from both local and remote servers and visualize them in a web client. They modified the TensorFlow framework and inserted some tags with the NVIDA Tools Extension (NVTX) library, which makes it so the changed framework can run on the normal GPU machine and their system. These tags give them some useful information, like the start and end of critical operators. And they can be visualized in the web client together with other profiling data.

As AI accelerators’ computation power grows rapidly and network speed improves, Pengfei and Lingling believe that pooling these accelerators together and providing services over networks is the future trend. They’re in the process of deploying their software in their R&D environment, with plans to open source the partial or whole solution so that their framework can work with any AI accelerator, not just GPUs.

Prerequisite knowledge

  • A working knowledge of GPU and TensorFlow programming

What you'll learn

  • Learn how to set up a flexible GPU-sharing system and performance optimizing tricks for TensorFlow in the system
Photo of Pengfei Fan

Pengfei Fan


Pengfei Fan is a senior heterogeneous computing engineer at Alibaba Cloud. Previously, he worked on GPU compute architecture at NVIDIA. Pengfei is focused on designing and implementing virtualization and scheduling systems for heterogeneous infrastructure to accelerate AI applications and improve hardware use.

Photo of Lingling Jin

Lingling Jin


Lingling Jin is a senior manager at Alibaba, where she focuses on heterogeneous infrastructures to accelerate AI applications and improve hardware use. Previously, she was part of NVIDIA’s Compute Architecture Group. She earned her PhD at the University of California, Riverside.

  • O'Reilly
  • TensorFlow
  • Google Cloud
  • IBM
  • Databricks
  • Tensor Networks
  • VMware
  • Amazon Web Services
  • One Convergence
  • Quantiphi
  • Lambda Labs
  • Tech Mahindra
  • cnvrg.io
  • Determined AI
  • Inferencery
  • Manceps, Inc.
  • PerceptiLabs
  • Valohai

Contact us


For conference registration information and customer service


For more information on community discounts and trade opportunities with O’Reilly conferences


For information on exhibiting or sponsoring a conference


For media/analyst press inquires