Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Semi-automated analytic pipeline creation and validation using active learning

Sean Ma (Trifacta)

2:40pm–3:20pm Wednesday, March 7, 2018

Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience
Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Who is this presentation for?

Data analysts, data engineers, and data scientists

Prerequisite knowledge

Familiarity with analytic, reporting, visualization, or business intelligence tools or concepts

What you'll learn

Learn methods for applying statistical and machine learning techniques for automatically detecting potential anomalies in data pipelines as well as predicting potential transformations to resolve such anomalies
Explore interaction and visualization techniques for validating and training models to identify and resolve issues in data pipelines

Description

Organizations leverage reporting, analytics, and machine learning pipelines to drive decision making and power critical operational systems. These pipelines increasingly rely on dynamic internal data sources or third-party data that often do not conform to a desired target data model. Additionally, the structure and encodings of these input data sources may change frequently and without warning. Any discrepancy between a source and expected target can break pipelines or lead to errors in downstream results that may be hard to detect and fix. Most software for constructing and monitoring such pipelines require users to manually craft both the data validation rules for detecting such issues and the data transformations required to resolve these inconsistencies. This process can be tedious, time consuming, and error prone.

Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models. Sean demonstrates how to leverage technical metadata and historical data to predict corrective transformations, visualization techniques that scale to thousands of discrepancies, and how interaction techniques can enable data analyst to train and validate predictive models, producing more rapid error detection and resolution. He concludes with examples from the retail, insurance, and pharmaceutical industries.

Sean Ma

Trifacta

Sean Ma is the Director of Product Management at Trifacta. With over 10+ years of experience in enterprise data management software, Sean has spent the last 5 years building Big Data products at companies such as Informatica and Trifacta. He holds a Bachelor of Science degree in Electrical Engineering and Computer Science from the University of California Berkeley.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com