Understanding data governance for machine learning models
Who is this presentation for?Data engineers, data architects, developers
Once you start deploying ML models to production, several facts and implications become apparent.
Dean Wampler and Boris Lublinsky justify that models are data, because they have all the characteristics of data, even in many cases how they’re “packaged” as collections of model parameters and hyperparameters rather than source code. Controlled access is required to prevent tampering, theft of intellectual property, and in some cases, leakage of sensitive information encapsulated in the models. Therefore, model access requires authentication, authorization, and auditing.
Reproducibility is crucial to ensure that you deliver to production exactly the model you think you’re delivering, avoiding all possible human error and as much uncertainty as possible. This requires that the full lifecycle process must be reproducible and automated, beginning with model hyperparameter selection and model training, through to delivery and eventual retirement. Here the mature DevOps practices of continuous integration/continuous delivery (CI/CD) can be adapted to model delivery, but a unique challenge is incorporating these development-centric techniques into a data science environment, where different tools, techniques, and approaches are common. Because model construction can be expensive, as well as not completely deterministic, even with the same training data, a repository of all models is useful.
Usually you retire a model because of concept drift, where the model has become less effective as the data it scores today changes (drifts) relative to the data used to train the model. This is another way in which models are data; they’re often replaced, more frequently than code is typically replaced. Where code is judged against a specification, models are judged against metrics.
Once deployed, model decisions must be explainable, especially when analyzing failure scenarios, understanding when requirements for fairness aren’t met, understanding what factors affect the training process, etc. This means metadata is required for each model, such as: When was it trained, deployed, and retired? Which people and systems accessed the model in any way and when? What kind of model is it? What are the model parameters and hyperparameters? What techniques were used to train the model? What dataset was used in training and what is its schema? What quality and performance metrics did the model achieve when trained?
Some of the unique characteristics of production model serving include the following: Unlike software components, which are expected to satisfy static specifications (at least for the life of that particular component), models are judged by the performance against metrics, such as scoring confidence, timing, resource overhead, etc. Developers aren’t accustomed to the inherently statistical nature of model serving, so data engineering needs to accommodate these differences. Models are also replaced periodically, either to account for concept drift or to try improved models. The infrastructure needs to support this replacement strategy, as well as patterns like canary deployments and speculative execution. Also, because a new model is hopefully improving the quality of the results, it’s inherently changing the behavior of the system.
These characteristics have implications for data governance. For example, what actual model instance was used to score a particular data record? If a score is added to a record, it’s usually also necessary to add a model identifier and possibly other metadata, like quality metrics.
From the above requirements and implications, you’ll explore emerging practices, such as adapting existing CI/CD practices for traceability, expanded to integrate well with data science processes; considerations for model updating in running applications; and available tools, services, and best practices.
- Experience with ML concepts
What you'll learn
- Learn why data governance is important for ML models in production
- Understand the unique requirements for data governance of ML models
- Identify ways to satisfy those requirements
Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the Lightbend Fast Data Platform project, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean’s the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.
Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires