San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Scaling model training: From flexible training APIs to resource management with Kubernetes

Kelley Rivoire (Stripe)

4:20pm–5:00pm Wednesday, March 27, 2019

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: Automation in data science and big data, Financial Services, Model lifecycle management

Average rating:

(4.33, 3 ratings)

Who is this presentation for?

Data scientists, machine learning engineers, and data infrastructure engineers

Level

Intermediate

Prerequisite knowledge

A basic understanding of machine learning and production engineering

What you'll learn

Learn how Stripe scaled machine learning with its Railyard API
Understand the challenges of reproducibility, reliability, and scale/automation
Explore considerations for machine learning APIs, including how to design interfaces to support heterogenous workloads as ML models and libraries evolve

Description

Model training is often a manual process using notebooks or command-line scripts run on a shared server or even a laptop. This is convenient for building intuition but at some point fails to scale: notebooks and command-line scripts generally aren’t reproducible, which can lead to confusion about what was running in production and when. Similarly, as a machine learning application benefits from an increasing count of models (e.g., a common pattern is developing user-specific models as well as a generic model) or increasingly large datasets, simple tasks like keeping track of training runs and managing computational resources quickly become untenable manually.

To help solve these problems, Stripe built Railyard, an easy-to-use API for training machine learning models, allowing fast, reliable iteration on model training. The Railyard workflow provides an API contract for users. Railyard will fetch your features and labels, split the data into training and test sets, pass along any extra JSON you passed to the API, and handle serialization and evaluation for your fitted estimator, completing your job. Railyard is a Scala service that exposes JSON endpoints for training models and fetching the results of model training runs. The service kicks off the training jobs and performs job-state management to track what’s being trained and when the jobs kick off and finish.

Railyard uses Kubernetes as an execution engine for all of the model training runs; Kubernetes performs resource allocation and management. This allows Stripe to flexibly support different resource types for training runs requiring more memory or GPUs, for instance. The combination of flexible API and execution engine facilitates continuous retraining of thousands of models every week, allowing the company to quickly evolve machine learning models, especially for adversarial machine learning applications like fraud, where models degrade more quickly. As part of continuous retraining, Stripe is able to evaluate not only individual models but also more sophisticated compositions of models.

Kelley Rivoire shares lessons learned building and evolving the Railyard API to support heterogeneous production model training workflows to support production models from logistic regression to deep learning and scaling model training using Kubernetes.

Kelley Rivoire

Stripe

Kelley Rivoire is the head of data infrastructure at Stripe, where she leads the Data Infrastructure Group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com