In contrast to traditional schema-on-write data warehouses, data lakes are schema on read. In other words, the onus of making sense of the data has moved to analysts and data scientists. Ideally, there should be an accurate data dictionary that covers details of attributes within the data lake (their business definition, lineage used to derive them, owner, etc.). Instead, teams today have to rely on dictionaries of collective wisdom—a mixed bag with regard to correctness: popular datasets tend to be accurate, but understanding of less frequently used datasets is spotty and inaccurate. Also, collective wisdom does not always keep up with changes or is only-half correct, significantly impacting the productivity of analysts and data scientists. Oftentimes, for instance, the reports and models incorrectly use data attributes, leading to incorrect deductions and wasted time.
While tools for maintaining data dictionaries are available, they require teams to manually add the attribute details and vary in the level of details, correctness, and update frequency. Is there a best of both worlds where the teams are not solely relying on collective wisdom but don’t have to spend significant time updating a data dictionary?
Sandeep Uttamchandani outlines three patterns to better manage data dictionaries:
Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com