Always accurate business metrics through lineage-based anomaly tracking
Who is this presentation for?Data engineers, data architects, developers
Debugging real-world data pipelines is highly nontrivial—finding the root cause of incorrect metrics across terabytes to petabytes of data can take hours and even days. The first step in the debugging process is understanding what changed, and there are several moving parts, including ETL code changes, source database configuration changes, application schema or data logic changes, dependencies of derived tables produced by other pipelines, data quality changes, raw data anomalies, metric anomalies, job scheduling errors, and so on.
Today, the approach to find these changes is trial and error and requires highly trained data engineers to work together with database engineers, data analysts, and scientists. As the industry democratizes the data platform, there’s a need to have an automated framework that analysts and data scientists can use to automatically track and flag every change that directly or indirectly impacts their pipeline.
Shradha Ambekar and Sunil Goplani outline how Intuit implemented this capability and developed SuperGLUE—Intuit’s open source data lineage framework, which automatically discovers data pipeline lineage. For each of the entities in the lineage, it tracks configuration, execution, and data profiling stats related to transactional sources, tables, jobs, and business events. By combining this information and applying anomaly detection, SuperGLUE helps debug data pipeline issues in minutes. It also provides the ability to subscribe to pipelines and raises alerts as soon as any execution or configuration changes or anomalies impacting the data pipelines is detected. By proactively detecting issues, SuperGLUE helps analyze downstream impact across pipelines at lightening speed.
Through lineage-based anomaly tracking, SuperGLUE helped achieve robust, reliable, and accurate business metrics at Intuit. It also helped improve developer productivity and minimized time to debug by 10x–100x.
You’ll learn the details of the design, implementation, and self-serve capabilities across different personas.
- A basic understanding of big data platforms and data pipelines
What you'll learn
- Learn techniques and ways to proactively monitor your data pipelines by the help of lineage and assess impact, anomaly detection techniques that Intuit applied on stats collected for each entity in the lineage, and concepts and architecture patterns required to build data pipeline lineage at scale
Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.
Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing, and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires