Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Half-correct and half-wrong collective data wisdom: 3 patterns to sanity

Sandeep U (Intuit)
11:1511:55 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 8/9
Average rating: ****.
(4.67, 3 ratings)

Who is this presentation for?

  • Data architects, data engineers, those in ML ops, and architects

Level

Beginner

Prerequisite knowledge

  • A general understanding about data pipeline architectures and popular technologies

What you'll learn

  • Learn three strategies for effectively managing a data dictionary

Description

In contrast to traditional schema-on-write data warehouses, data lakes are schema on read. In other words, the onus of making sense of the data has moved to analysts and data scientists. Ideally, there should be an accurate data dictionary that covers details of attributes within the data lake (their business definition, lineage used to derive them, owner, etc.). Instead, teams today have to rely on dictionaries of collective wisdom—a mixed bag with regard to correctness: popular datasets tend to be accurate, but understanding of less frequently used datasets is spotty and inaccurate. Also, collective wisdom does not always keep up with changes or is only-half correct, significantly impacting the productivity of analysts and data scientists. Oftentimes, for instance, the reports and models incorrectly use data attributes, leading to incorrect deductions and wasted time.

While tools for maintaining data dictionaries are available, they require teams to manually add the attribute details and vary in the level of details, correctness, and update frequency. Is there a best of both worlds where the teams are not solely relying on collective wisdom but don’t have to spend significant time updating a data dictionary?

Sandeep Uttamchandani outlines three patterns to better manage data dictionaries:

  • An autopopulated dictionary using a lineage tool that tracks column-level lineage and autopopulates data attributes within the dictionary. The lineage tracking is end to end starting from the source tables.
  • Attribute comment extraction during Git check-in: As a part of the code review process, data fields need to be filled in with comments. These comments are automatically used to populate the dictionary. Any DDL changes are triggered and captured with versioned comments in the data dictionary.
  • Data pipeline change tracking: As a part of the change tracking of ETL and analytics data pipelines, any DDL items are flagged as backlog items to update the dictionary.
Photo of Sandeep U

Sandeep U

Intuit

Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.