Who is this presentation for?Data engineers, data architects, developers
When considering implementations of model serving, the actual inference is only part of the problem. While models get the most press coverage, data remains the main bottleneck in most ML projects. As a result, data quality, governance, and lineage become some of the common requirements for model serving implementations. It requires an understanding of the complete data flow involved in model serving, specifically: What dataflow do you need to capture to be able to confidently manage the model’s deployment and usage? What metadata should you capture for this data flow, who needs this information, and how will it be used? Choose a metadata management tool that allows you to store and track metadata itself and its inter relationships.
ML and model serving are quickly becoming an integral part of enterprise data processing and, as such, require data governance solutions that help answer questions such as:
- Which sources were used to create the model?
- Which parameter configurations were used for model creation?
- Where are the exported models stored and is the storage secure?
- What type of model is it (e.g., TensorFlow, logistic regression)?
- What are the model quality parameters?
- What are data definitions for model inputs and outputs? How are they related to the overall applications flow?
- Which taxonomies should be used for models and data?
- What are the data schemas of these sources?
- Which processes are using the model and how is data transformed?
- Can you classify the data as private, public, etc.? What are overall data authentication and authorization concerns?
To answer some of these questions, Kubeflow recently introduced ML Metadata, a library for recording and retrieving metadata associated with ML-related data developer and data scientist workflows. Although ML Metadata solves some of the requirements for metadata management for model creation, it doesn’t cover model serving concerns.
Boris Lublinsky and Dean Wampler outline this missing piece—metadata management for model serving. They explore what information about running systems you need and why it’s important and define metadata (and its format), which needs to be captured for model serving, including model metadata; metadata for the model inputs and outputs—definition of data schemas for input and output of the model, including definitions of the parameters; and model serving deployment artifacts, including runtimes, etc.
Once the metadata is fully defined, you can use any metadata management tool to work with it. Boris and Dean use Apache Atlas as an example of how a metadata framework can be leveraged to manage model serving metadata, understand it, and take appropriate actions as required. You’ll learn how Atlas REST APIs can create and populate metadata and how to use Atlas UI for viewing this information. And you’ll discover approaches to integrate enterprise applications with Atlas.
- A working knowledge of model serving
- Experience with data processing
- Familiarity with programming, primarily Java and Scala
Materials or downloads needed in advance
- A laptop with Java, Scala, SBT, Scala development environment (Intellij) and Docker software installed
- Download the starter project
What you'll learn
- Understand the role of metadata and its management approaches in model serving implementation and Apache Atlas and its capabilities for gathering, processing, and maintaining metadata
- Get hands-on experience creating and managing metadata
Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.
Dean Wampler":http://twitter.com/deanwampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He is Head of Developer Relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he is the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires