Container orchestrator to DL workload, Bing's approach: FrameworkLauncher

Kai Liu (BING) (Microsoft), Yuqi Wang (Microsoft), Bin Wang (Microsoft)

4:00pm–4:40pm Thursday, September 12, 2019

Location: 230 B

Implementing AI

Average rating:

(4.00, 1 rating)

Download slides (PPTX)

Who is this presentation for?

CTOs and directors for AI platforms

Level

Beginner

Description

Bing in Microsoft has a large-scale deployment of Hadoop, Spark, Kafka, and other open source technologies with more than 1 million cores and 4 million GBs of RAM. It needs to run large, complex workflows and services top of the stack, but there are challenges to orchestrate containers for workflows and services at its scale and no existing solutions fully meet its needs. The company created and open-sourced a technology called FrameworkLauncher. It has a proven track record in Microsoft Bing’s large-scale production environment and has partially open-sourced as the most core component of Open Platform for AI.

Kai Liu, Yuqi Wang, and Bin Wang explore the main feature set of FrameworkLauncher. It has high availability, where all launcher and Hadoop components are recoverable and work preserving, so user services are designed to remain uninterrupted when components shut down, crash, upgrade, or are out for a long time. It has high usability, so no user code changes are needed to run existing executable inside the container. It also includes services and batch jobs requirements, such as GPU scheduling, port scheduling, and gang scheduling, among others. And it has a number of cluster-provider related features, such as AskMode to extend machine maintenance time for uninterrupted workloads, workload deployment, and launcher watchdog and alert.

Prerequisite knowledge

Familiarity with Apache YARN

What you'll learn

Learn a general approach to designing a container orchestrator for all kinds of workloads

Kai Liu (BING)

Microsoft

Kai Liu is a senior program manager in the AI and Research Group of Microsoft. He has seven years of experience in data-driven engineering, big data platform, and AI infrastructure for Office product families. He led his team to create a service health portal for SharePoint Online, inject a distributed log collection and storage system for Exchange Online, publish curated datasets and key business metrics, and enable subhour experimentations in Office 365. He’s working on the AI and deep learning infrastructure for large-scale enterprise data under compliance obligations.

Website

Yuqi Wang

Microsoft

Yuqi Wang is a software engineer in the AI and Research Group of Microsoft. He has three years of experience in Apache YARN, container orchestration, and AI infrastructure. He’s the author and maintainer for Microsoft FrameworkLauncher, which is built to orchestrate all kinds of workloads through the same interface without making changes to the workload themselves. He has also internally contributed several features into YARN to support long-running service better on Windows. He’s working on the FrameworkLauncher to support AI workloads better and running natively on Kubernetes.

Website

Bin Wang

Microsoft

Bin Wang is a principal software engineering manager in the AI and Research Group of Microsoft, where he’s the tech manager of the multitenancy team and the go-to person across the entire platform team in this area. He’s initiated key efforts to improve the stability of YARN, which now is deployed to 30,000+ machines and supporting 30P+ cold data. He also leads efforts in supporting model training such as ChaNa and LR/MCLR on YARN, which has contributed to ads selection, PA, MM, AdInsight, relevance, etc. He leads the team to support Linux workloads on Windows by extending YARN to support on-demand VM lifecycle provisioning. The MT effort now extends to other key AIR scenarios, such as image processing, DR, Malta data processing, bot trainer, etc. Bin also leads the development of OSS DL training platform OpenPAI, which is specifically designed to be user friendly and extensible for various DL training frameworks and can run on on-premises as well as on cloud environments.