Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA
Please log in

Persistent storage for machine learning in KubeFlow

Skyler Thomas (MapR), Terry He (MapR Technologies)
5:10pm5:50pm Wednesday, March 27, 2019
Average rating: ****.
(4.75, 4 ratings)

Who is this presentation for?

  • CDOs, data scientists, data engineers, and machine learning engineers



Prerequisite knowledge

  • Familiarity with container and ML technology (useful but not required)

What you'll learn

  • Learn how to use persistent storage to support parallelized ML frameworks with differing compute footprints


The Kubernetes and Hadoop ecosystems are conglomerates of occasionally integrated and interrelated tools intended for use by data scientists and data engineers. The advantage conferred by Kubernetes has been the ability to deploy prebuilt offerings from container registries, allowing tools to be easily downloaded (pulled) and deployed on systems, without the traditional install pain around compiling from source that was frequently present in Hadoop ecosystem projects.

This approach is sufficient for simple deployments of single containers running isolated processes. But in most cases, users want to scale workflows up and down, using multiple containers to run parallel processes. In order to do this, templatized offerings and the ability to easily deploy them are needed. The most common way to manage this in Kubernetes is with Helm Charts, Operators, or ksonnets, which are collections of YAML files that describe a deployment template such that it’s reproducible and can be used to generate interconnected pods of containers on demand.

KubeFlow makes all of this functionality a bit more user-friendly by providing some of the commonly used machine learning projects as prebuilt templatized offerings (ksonnets) that are pretested to integrate together in one Kubernetes namespace. The initial list is based off of a common TensorFlow deployment pattern and has been opened up to support other engines and modules. This is revolutionary for managing the complexity around parallelizing compute engines and scaling workflows up and down. But for it to work as intended, there’s a need for a complementary storage layer that can serve the data and models to the compute workflows and even save state.

Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. You’ll learn the benefits of using Kubeflow and the considerations for mounting persistent storage to a KubeFlow tenant in order to provide a unified security model and secure data access that doesn’t require any data movement.

Skyler Thomas


Skyler Thomas is an engineer at MapR, where he is designing Kubernetes-based infrastructure to deliver machine learning and big data applications at scale. Previously, Skyler was chief architect for WebSphere user experience at IBM, where he worked with more than a hundred customers to deliver extreme-scaled applications in the healthcare, financial services, and retail industries.

Photo of Terry He

Terry He

MapR Technologies

Terry He is a senior director of engineering at MapR, where he manages MapR’s Hadoop and ecosystem engineering teams and leads the company’s AI/ML initiatives.