Model training is often a manual process using notebooks or command-line scripts run on a shared server or even a laptop. This is convenient for building intuition but at some point fails to scale: notebooks and command-line scripts generally aren’t reproducible, which can lead to confusion about what was running in production and when. Similarly, as a machine learning application benefits from an increasing count of models (e.g., a common pattern is developing user-specific models as well as a generic model) or increasingly large datasets, simple tasks like keeping track of training runs and managing computational resources quickly become untenable manually.
To help solve these problems, Stripe built Railyard, an easy-to-use API for training machine learning models, allowing fast, reliable iteration on model training. The Railyard workflow provides an API contract for users. Railyard will fetch your features and labels, split the data into training and test sets, pass along any extra JSON you passed to the API, and handle serialization and evaluation for your fitted estimator, completing your job. Railyard is a Scala service that exposes JSON endpoints for training models and fetching the results of model training runs. The service kicks off the training jobs and performs job-state management to track what’s being trained and when the jobs kick off and finish.
Railyard uses Kubernetes as an execution engine for all of the model training runs; Kubernetes performs resource allocation and management. This allows Stripe to flexibly support different resource types for training runs requiring more memory or GPUs, for instance. The combination of flexible API and execution engine facilitates continuous retraining of thousands of models every week, allowing the company to quickly evolve machine learning models, especially for adversarial machine learning applications like fraud, where models degrade more quickly. As part of continuous retraining, Stripe is able to evaluate not only individual models but also more sophisticated compositions of models.
Kelley Rivoire shares lessons learned building and evolving the Railyard API to support heterogeneous production model training workflows to support production models from logistic regression to deep learning and scaling model training using Kubernetes.
Kelley Rivoire is an engineering manager at Stripe, where she leads the data infrastructure group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org