Who is this presentation for?Data engineers, data architects, developers
When considering implementations of model serving, the actual inference is only part of the problem. While models get the most press coverage, the reality is that data remains the main bottleneck in the most ML projects. As a result, data quality, governance and lineage becomes on of the common requirements for model serving implementations. It requires an understanding of the complete data flow involved in model serving, specifically:
1. What dataflow in your system do you need to capture to be able to confidently manage the model’s deployment and usage.
2. What metadata should you capture for this data flow? Who needs this information and how will it be used?
3. Choose metadata management tool allowing to store and track metadata itself and its inter relationships.
Machine learning and model serving are quickly becoming an integral part of enterprise data processing and as such require data governance solutions that help answer questions such as:
1. Which sources have been used to create the model?
2. Which parameter configurations were used for model creation?
3. Where are the exported models stored and is the storage secure?
4. What type of model it is (e.g., TensorFlow, logistic regression)?
5. What are the model quality parameters?
6. What are data definitions for model inputs and outputs? How are they related to the overall applications flow?
7. Which taxonomies should be used for models and data?
8. What are the data schemas of these sources?
9. Which processes are using model and how is data transformed by them?
10. Can we classify the data as private, public, etc? What are overall data authentication and authorization concerns?
To answer some of these questions, Kubeflow recently introduced ML Metadata , a library for recording and retrieving metadata associated with machine learning-related data developer and data scientist workflows. Although ML Metadata solves some of the requirements for metadata management for model creation (1-3), it does not cover model serving concerns (4-10).
In this tutorial we will cover this missing piece – metadata management for model serving. We will start by discussing what information about running system we need and why it is important. We will proceed with defining metadata (and its format), which needs to be captured for model serving, including:
Model metadata – answering questions 4-6.
Metadata for the model inputs and outputs – definition of data schemas for input and output of the model, including definitions of the parameters.
Model serving deployment artifacts, including runtimes, etc.
Once the metadata is fully defined, we can use any metadata management tool to work with it. In this tutorial we will use (Apache Atlas) as an example of how a metadata framework can be leveraged to manage model serving metadata, understand it, and then take appropriate actions as required.
We will show how Atlas REST APIs can be used for creating and populating metadata and how to use Atlas UI for viewing this information.
Finally we will discuss approaches to integrate enterprise applications with Atlas.
Prerequisite knowledge1. Knowledge and understanding of model serving 2. Experience with data processing 3. Basic programming skills, primarily Java and Scala
Materials or downloads needed in advance
What you'll learn
Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.
Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the Lightbend Fast Data Platform project, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsor
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires