The Kubernetes and Hadoop ecosystems are conglomerates of occasionally integrated and interrelated tools intended for use by data scientists and data engineers. The advantage conferred by Kubernetes has been the ability to deploy prebuilt offerings from container registries, allowing tools to be easily downloaded (pulled) and deployed on systems, without the traditional install pain around compiling from source that was frequently present in Hadoop ecosystem projects.
This approach is sufficient for simple deployments of single containers running isolated processes. But in most cases, users want to scale workflows up and down, using multiple containers to run parallel processes. In order to do this, templatized offerings and the ability to easily deploy them are needed. The most common way to manage this in Kubernetes is with Helm Charts, Operators, or ksonnets, which are collections of YAML files that describe a deployment template such that it’s reproducible and can be used to generate interconnected pods of containers on demand.
KubeFlow makes all of this functionality a bit more user-friendly by providing some of the commonly used machine learning projects as prebuilt templatized offerings (ksonnets) that are pretested to integrate together in one Kubernetes namespace. The initial list is based off of a common TensorFlow deployment pattern and has been opened up to support other engines and modules. This is revolutionary for managing the complexity around parallelizing compute engines and scaling workflows up and down. But for it to work as intended, there’s a need for a complementary storage layer that can serve the data and models to the compute workflows and even save state.
Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. You’ll learn the benefits of using Kubeflow and the considerations for mounting persistent storage to a KubeFlow tenant in order to provide a unified security model and secure data access that doesn’t require any data movement.
Skyler Thomas is an engineer at MapR, where he is designing Kubernetes-based infrastructure to deliver machine learning and big data applications at scale. Previously, Skyler was chief architect for WebSphere user experience at IBM, where he worked with more than a hundred customers to deliver extreme-scaled applications in the healthcare, financial services, and retail industries.
Terry He is a senior director of engineering at MapR, where he manages MapR’s Hadoop and ecosystem engineering teams and leads the company’s AI/ML initiatives.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com