Organizations leverage reporting, analytics, and machine learning pipelines to drive decision making and power critical operational systems. These pipelines increasingly rely on dynamic internal data sources or third-party data that often do not conform to a desired target data model. Additionally, the structure and encodings of these input data sources may change frequently and without warning. Any discrepancy between a source and expected target can break pipelines or lead to errors in downstream results that may be hard to detect and fix. Most software for constructing and monitoring such pipelines require users to manually craft both the data validation rules for detecting such issues and the data transformations required to resolve these inconsistencies. This process can be tedious, time consuming, and error prone.
Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models. Sean demonstrates how to leverage technical metadata and historical data to predict corrective transformations, visualization techniques that scale to thousands of discrepancies, and how interaction techniques can enable data analyst to train and validate predictive models, producing more rapid error detection and resolution. He concludes with examples from the retail, insurance, and pharmaceutical industries.
Sean Ma is the Director of Product Management at Trifacta. With over 10+ years of experience in enterprise data management software, Sean has spent the last 5 years building Big Data products at companies such as Informatica and Trifacta. He holds a Bachelor of Science degree in Electrical Engineering and Computer Science from the University of California Berkeley.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org