Mar 15–18, 2020

Data lineage enables reproducible and reliable machine learning at scale

Sihui Hu (Microsoft), Dom Divakaruni (Microsoft)
4:15pm4:55pm Tuesday, March 17, 2020
Location: LL20C

Who is this presentation for?

Data scientists or analysts




During iterative model development, data scientists need a way to ensure result reproducibility so they don’t claim gains from changing one parameter without realizing that hidden sources of changes are the real source of improvement. You need to capture how the original data source got extracted and transformed along its journey to its current state. You also need the ability to version and reapply the same training and evaluation dataset to accurately compare model performance. This becomes especially critical when teams collaborate on building models or leverage features and embeddings curated by other teams. Once models are deployed for production, when machine learning engineers look to understand why model performance degrades over time, they need to examine the full lineage graph to understand how the training data was extracted and transformed and retrieve the environment setup and model training code to debug and determine strategies for retraining.

Sihui “May” Hu and Dominic Divakaruni unpack effective ways to track the full lineage from data preparation to model training to inference, including retrieving data-to-data, data-to-model, and model-to-deployment lineages in one graph. You’ll see a demo that illustrates a full machine learning lineage graph from data preparation to model training to inference, retrieve the source from the same raw dataset and its usage in various machine learning experiments, retrieve how the model was trained and how it was used in applications, and retrieve the input dataset, environment setup, training code, and output models from a machine learning experiment.

Prerequisite knowledge

  • A basic understanding of machine learning and the cloud

What you'll learn

  • Learn why data lineage is critical for reproducible and reliable machine learning at scale
  • Discover effective ways to track the full lineage from data preparation to model training to inference
Photo of Sihui Hu

Sihui Hu


Sihui “May” Hu (she/her) is a program manager at Microsoft, focused on creating data management and data lineage solutions for the Azure Machine Learning service. Previously, she had two years of working experience in the ecommerce industry and several internships in product management. She graduated from Carnegie Mellon University, studying information systems management.

Dom Divakaruni


Dominic Divakaruni is a principal product leader at Dom Divakaruni is a principal group program manager at Microsoft working on the Azure Machine Learning platform. Current areas of focus include applying and managing data for machine learning including, data access, exploratory data analysis, data lineage, and data drift. Dom’s prior work includes building tools to help customers deploy models to production, deep learning frameworks, accelerated computing and GPUs.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires