Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Semi-automated analytic pipeline creation and validation using active learning

Sean Ma (Trifacta)
2:40pm3:20pm Wednesday, March 7, 2018
Secondary topics:  Data Integration and Data Pipelines

Who is this presentation for?

  • Data analysts, data engineers, and data scientists

Prerequisite knowledge

  • Familiarity with analytic, reporting, visualization, or business intelligence tools or concepts

What you'll learn

  • Learn methods for applying statistical and machine learning techniques for automatically detecting potential anomalies in data pipelines as well as predicting potential transformations to resolve such anomalies
  • Explore interaction and visualization techniques for validating and training models to identify and resolve issues in data pipelines


Organizations leverage reporting, analytics, and machine learning pipelines to drive decision making and power critical operational systems. These pipelines increasingly rely on dynamic internal data sources or third-party data that often do not conform to a desired target data model. Additionally, the structure and encodings of these input data sources may change frequently and without warning. Any discrepancy between a source and expected target can break pipelines or lead to errors in downstream results that may be hard to detect and fix. Most software for constructing and monitoring such pipelines require users to manually craft both the data validation rules for detecting such issues and the data transformations required to resolve these inconsistencies. This process can be tedious, time consuming, and error prone.

Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models. Sean demonstrates how to leverage technical metadata and historical data to predict corrective transformations, visualization techniques that scale to thousands of discrepancies, and how interaction techniques can enable data analyst to train and validate predictive models, producing more rapid error detection and resolution. He concludes with examples from the retail, insurance, and pharmaceutical industries.

Photo of Sean Ma

Sean Ma


Sean Ma is the Director of Product Management at Trifacta. With over 10+ years of experience in enterprise data management software, Sean has spent the last 5 years building Big Data products at companies such as Informatica and Trifacta. He holds a Bachelor of Science degree in Electrical Engineering and Computer Science from the University of California Berkeley.