Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Half correct and Half wrong tribal data knowledge: Our 3 patterns to sanity!

11:1511:55 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 8/9
Secondary topics:  Data preparation, data governance, and data lineage, Financial Services

Who is this presentation for?

Data Architects, Data Engineers, MLOps, Architects

Level

Beginner

Prerequisite knowledge

A general understanding about data pipeline architectures and popular technologies.

What you'll learn

This talk covers lessons learnt in effectively managing a data dictionary enabling analysts and data scientists to be significantly more productive (instead of relying on incorrect and outdated tribal knowledge).

Description

In contrast to traditional schema-on-write data warehouses, Data Lakes are schema-on-read — the onus of making sense of the data has moved to analysts and data scientists. Ideally there should be an accurate data dictionary that covers details of attributes within the Data Lake — their business definition, lineage used to derive them, owner, etc. Instead, teams today have to rely on tribal data dictionaries which is a mixed bag w.r.t. correctness — popular datasets tend to be accurate, but for less frequently used datasets have spotty and inaccurate understanding. Also, tribal information does not keep up with changes. A half correct tribal dictionary significantly impacts productivity of analysts and data scientists — often times the reports and models incorrectly use data attributes leading to incorrect deductions and wasted time.

While tools for data dictionary are available, they require teams to manually add the attribute details varying in level of details, correctness, and update frequency. Is there a best of both worlds where the teams are not solely relying on tribal knowledge as well as they don’t have to spend significant time in updating a data dictionary. In this talk, we describe 3 patterns we embarked for sanity of data dictionaries

  • Auto-populated dictionary using Lineage Tool: We developed a lineage tool that tracks column-level lineage and auto-populates data attributes within the dictionary. The lineage tracking is end-to-end starting from the source tables.
  • Attribute comment extraction during git check-in: As a part of the code review process, data fields needs to commented. These comments are automatically used to populate the dictionary. Any DDL changes are triggered and captured with versioned comments in the data dictionary.
  • Data Pipeline change tracking: As a part of the change tracking of ETL and analytics data pipelines, any DDL items are flagged as backlog items to update the dictionary
Photo of Sandeep Uttamchandani

Sandeep Uttamchandani

Intuit

Sandeep Uttamchandani is the hands-on Chief Data Architect at Intuit. He is currently leading the Cloud transformation of the Big Data Analytics, ML, and Transactional platform used by 3M+ Small Business Users for financial accounting, payroll, and billions of dollars in daily payments. Prior to Intuit, Sandeep has played various engineering roles at VMware, IBM, as well as founding a startup focused on ML for managing Enterprise systems. Sandeep’s experience uniquely combines building Enterprise data products and operational expertise in managing petabyte scale data and analytics platforms in production for IBM’s Federal and Fortune 100 customers. Sandeep has received several excellence awards, and over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, USENIX. Sandeep is a regular speaker at academic institutions, guest lectures for university courses, as well as conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as Program Committee Member for systems and data conferences, and the past associate editor for ACM Transactions on Storage. He blogs on LinkedIn and Wrong Data Fabric (his personal blog). Sandeep is a Ph.D. in Computer Science from University of Illinois at Urbana-Champaign.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)