10 lead indicators before data becomes a mess
Who is this presentation for?Data engineers, data architects, developers
“Data is a mess” are common words of data scientists across the world. There’s no silver bullet technology to ensure high-quality data. Instead, quality is dependent on the quality checks and bounds during the lifecycle of data as it’s generated at the source, stored in a lake, transformed, and analyzed by pipelines. The processes during data lifecycle are the lead indicators to predict data quality, in other words, before data becomes a mess.
The lifecycle of the data is divided into following four stages: source—data creation within the application tier, that is, transactional databases, clickstream, logs, IoT sensors, etc.; ingestion—data collected from the sources in batch or real time and stored in the lake; prep—data available in the catalog documenting the attributes of the data as well as metadata properties such as value distributions, enums, etc.; and metrics logic—transformation of the data into derived attributes and aggregates made available as metrics and features.
Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani lead a deep dive into indicators Intuit developed at each stage of the data lifecycle that contributes to the resulting data quality. Corresponding to these indicators, they detail the key tools and frameworks that radically improved data quality for its key business models and dashboards: data definition language (DDL) alerting for source databases (source); monitoring of CDC replication frameworks (source); data availability monitoring for third-party sources (ingestion); data parity validation during ingestion (ingestion); anomaly tracking of data lake table properties (ingestion); data format validation for semistructured data (ingestion); automated data contracts with cross-BU data silos (prep); ETL traceability framework (prep); sandbox CI/CD automation (metrics logic); and business definition versioning (metrics logic).
You’ll discover how to take control of your data quality and move it from a passive lag measure to a proactive lead measure.
- Experience managing data platforms in production
What you'll learn
- Discover how to proactively track data quality during the its lifecycle stages and the tools you can use for check and bounds
Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.
Giriraj Bagdi is a DevOps leader of cloud and data at Intuit, where he leads infrastructure engineering and SRE teams in delivering technology and functional capabilities for online platforms. He drove and managed large complex initiatives in cloud data-infrastructure, automation engineering, big data, and database transactional platform. Giriraj has extensive knowledge of building engineering solutions and platforms to improve the operational efficiency of cloud infrastructure in the areas of command and control and data reliability for big data, high-transaction, high-volume, and high-availability environments. He drives the initiative in transforming big data engineering and migration to AWS big data technologies, in other words, EMR, Athena QuickSight, etc. He’s an innovative, energetic, and goal-oriented technologist and a team player with strong problem solving skills.
Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing, and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires