Time travel for data pipelines: Solving the mystery of what changed
Who is this presentation for?
- Data engineers and architects
Debugging real-world data pipelines is highly nontrivial—finding the root cause of incorrect insights can take hours…or even days. The first step in the debugging process is to understand what changed, and there are several moving parts: ETL code changes, ingestion configuration changes, source database configuration changes, application schema or data logic changes, dependencies of derived tables produced by other pipelines, data lake or warehouse config changes, data quality changes, job scheduling errors, data fabric failures, cross-database data integrity issues, and so on.
Today, the approach to find these changes is trial and error, and requires highly trained data engineers to work together with database engineers, data analysts, and scientists. As we democratize the data platform, we need to have an automated framework that analysts and data scientists can use to automatically track and flag every change that directly or indirectly impacts their pipeline.
Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani explore SuperGLUE, Intuit’s homegrown data lineage framework. SuperGLUE automatically discovers data pipeline lineage. For each of the entities in the lineage, it tracks configuration and execution stats related to transactional sources, big data fabric, data cardinality, jobs, and business events. By combining this information and applying anomaly detection, SuperGLUE helps debug data pipeline issues in minutes. They dive into the details of the design and implementation, as well as Intuit’s phased execution approach. You’ll see the adoption of this self-serve capability across different personas. And moving forward, Intuit is working on proactive problem alerts with the ability to raise alerts as soon as configuration or execution changes impacting the data pipelines is detected.
- A basic understanding of big data platforms and data pipelines
What you'll learn
- Learn concepts and architecture patterns required to build data pipeline lineage at scale, techniques and ways to proactively monitor your data pipelines by help of lineage, anomaly-detection techniques that Intuit has applied on stats collected for each entity in the lineage
Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with HDFS, Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.
Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.
Sandeep Uttamchandani is a chief data architect at Intuit, where he leads the cloud transformation of the big data analytics, ML, and transactional platform used by 4M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep was cofounder and CEO of a machine learning startup focused on ML for managing enterprise systems and played various engineering roles at VMware and IBM. His experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production. He’s received several excellence awards, over 40 issued patents , and 25 publications in key systems conferences such as the International Conference on Very Large Data Bases (VLDB), Special Interest Group on Management of Data (SIGMOD), Conference on Innovative Data Systems Research (CIDR), and USENIX. He’s a regular speaker at academic institutions, guest lectures for university courses, and conducts conference tutorials for data engineers and scientists, as well as advising PhD students and startups, serving as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He holds a PhD in computer science from the University of Illinois Urbana-Champaign.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires