Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Scaling model training: From flexible training APIs to resource management with Kubernetes

Kelley Rivoire (Stripe)
4:20pm5:00pm Wednesday, March 27, 2019
Secondary topics:  Automation in data science and big data, Financial Services, Model lifecycle management
Average rating: ****.
(4.33, 3 ratings)

Who is this presentation for?

  • Data scientists, machine learning engineers, and data infrastructure engineers



Prerequisite knowledge

  • A basic understanding of machine learning and production engineering

What you'll learn

  • Learn how Stripe scaled machine learning with its Railyard API
  • Understand the challenges of reproducibility, reliability, and scale/automation
  • Explore considerations for machine learning APIs, including how to design interfaces to support heterogenous workloads as ML models and libraries evolve


Model training is often a manual process using notebooks or command-line scripts run on a shared server or even a laptop. This is convenient for building intuition but at some point fails to scale: notebooks and command-line scripts generally aren’t reproducible, which can lead to confusion about what was running in production and when. Similarly, as a machine learning application benefits from an increasing count of models (e.g., a common pattern is developing user-specific models as well as a generic model) or increasingly large datasets, simple tasks like keeping track of training runs and managing computational resources quickly become untenable manually.

To help solve these problems, Stripe built Railyard, an easy-to-use API for training machine learning models, allowing fast, reliable iteration on model training. The Railyard workflow provides an API contract for users. Railyard will fetch your features and labels, split the data into training and test sets, pass along any extra JSON you passed to the API, and handle serialization and evaluation for your fitted estimator, completing your job. Railyard is a Scala service that exposes JSON endpoints for training models and fetching the results of model training runs. The service kicks off the training jobs and performs job-state management to track what’s being trained and when the jobs kick off and finish.

Railyard uses Kubernetes as an execution engine for all of the model training runs; Kubernetes performs resource allocation and management. This allows Stripe to flexibly support different resource types for training runs requiring more memory or GPUs, for instance. The combination of flexible API and execution engine facilitates continuous retraining of thousands of models every week, allowing the company to quickly evolve machine learning models, especially for adversarial machine learning applications like fraud, where models degrade more quickly. As part of continuous retraining, Stripe is able to evaluate not only individual models but also more sophisticated compositions of models.

Kelley Rivoire shares lessons learned building and evolving the Railyard API to support heterogeneous production model training workflows to support production models from logistic regression to deep learning and scaling model training using Kubernetes.

Photo of Kelley Rivoire

Kelley Rivoire


Kelley Rivoire is an engineering manager at Stripe, where she leads the data infrastructure group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.