Put AI to Work
April 15-18, 2019
New York, NY
Please log in

Building a production-scale ML platform

YU DONG (Facebook)
1:50pm2:30pm Wednesday, April 17, 2019
Implementing AI
Location: Trianon Ballroom
Secondary topics:  Media, Marketing, Advertising, Platforms and infrastructure
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • ML engineers, data scientists, product managers, and research scientists



Prerequisite knowledge

  • A basic understanding of machine learning

What you'll learn

  • Understand the main motivations for and challenges to building a production-scale platform, the representative use cases of production-scale ML platform in products and services, and the key ML workflow, components, and approaches to achieve production-scale performance


Yu Dong offers an overview of the why, what, and how of building a production-scale ML platform based on ongoing ML research trends and industry adoptions.

The motivation:

  • Democratized AI: In August 2018, Gartner said that democratized AI will be one of the major trends that will shape our future technologies. AI technologies will be “virtually everywhere” over the next 10 years but will be open to the masses rather than being purely commercial. Cloud computing, open source projects, and the “maker” community will mold this trend, eventually “propelling AI into everyone’s hands.” AI-based platform-as-a-service (PaaS) solutions, autonomous driving, mobile robots, and conversational AI platforms and assistants are expected to become major enterprise technologies in the future.
  • “One size doesn’t fit all." We want our platforms to act differently based on the information they’re given. This “one size doesn’t fit all” trend will lead to popular needs for a production-scale ML platform that can digest tons of raw data from variety of sources and generate or enable personalized models, services, and products at scale.

The challenges:

  • Scalability: The scale factor spans across the whole ML lifecycles from larger datasets to more complex features and models to increasing prediction requests, which brings various scalability challenges to ML platform and underneath infrastructure resources from compute to storage to network.
  • Stability: Obviously stability is a critical factor to any software platform, since you won’t have high expectations of an unstable platform that always fails your request. For ML, ensuring a successful E2E ML workflow becomes a surging challenge due to trends of more complicated models exploration, larger amount of unverified dataset processing, and cheaper commodity hardware adoption.
  • Cost-awareness: Everyone wants to train a perfect ML model that can serve all requests in an optimal way, but no one can afford it with an infinite training cost, since every company has its own budget. Cost-aware ML is becoming a determining factor of any ML platform’s cost efficiency and economic of scale.
  • Usability: Not every ML platform user is a ML expert. In reality, ~75% future ML developers might just use pretrained ML models directly or do some simple tuning and deploy in their projects directly based on a recent survey. On the other hand, certain ML researchers and engineers will use the platform to try various experimentations from complicated feature engineering to innovative model arch search. Building a highly usable ML platform to serve different needs of platform users is a nontrivial challenge.
Photo of YU DONG



Yu Dong is a senior technical product manager at Facebook, where he works on the company’s AI/ML platform, FBLearner, which enables more personalized and smarter products. Previously, he was senior software engineer manager at HPE and Cisco. He holds a PhD in computer engineering and an MBA from the University of California, Berkeley. His passion is to democratize AI across various industries by building a performant, reliable, efficient, resilient, and easy-to-use AI platform.