Time travel for data pipelines: Solving the mystery of what changed
Who is this presentation for?Data engineers and architects
Debugging real-world data pipelines is highly non-trivial — finding the root-cause of incorrect insights can take hours and even days! The first step in the debugging process is understanding what changed? There are several moving parts:
ETL code changes
Ingestion configuration changes
Source database configuration changes
Application schema or data logic changes
Dependencies of derived tables produced by other pipelines
Data Lake or Warehouse config changes
Data quality changes
Job scheduling errors
Data Fabric failures
Cross-DB data integrity issues
…and so on…
Today, the approach to find these changes is trial-and-error, and requires highly trained data engineers to work together with database engineers, data analysts and scientists. As we democratize the data platform, there is a need to have an automated framework that analysts and data scientists can use to automatically track and flag every change that directly or indirectly impacts their pipeline.
At Intuit, we have implemented this capability and developed SuperGlue: Intuit’s homegrown data lineage framework. SuperGlue automatically discovers data pipeline lineage. For each of entities in the lineage, it tracks configuration and execution stats related to transactional sources, big data fabric, data cardinality, jobs, business events. By combining this information and applying anomaly detection, SuperGlue helps debug data pipeline issues in minutes. This talk describes details of the design and implementation, as well as our phased execution approach. It describes the adoption of this self-serve capability across different personas. Moving forward, we are working on proactive problem alerts with the ability to raise alerts as soon as configuration or execution changes impacting the data pipelines is detected.
Prerequisite knowledgeA basic understanding of big data platforms and data pipelines
What you'll learn
Shradha Ambekar is a staff software engineer with the Small Business Data Group at Intuit. She has experience working with Hadoop, Spark, Kafka, Cassandra and Vertica. She is the technical lead for Lineage Framework (SuperGlue), Real-Time analytics and has made several key contributions in building solutions around data platform at Intuit. She is a contributor to spark-cassandra-connector.She is a speaker at Open Source O’reilly Conference 2019. Prior to joining Intuit, she worked as a software engineer with Rearden Commerce. She has a bachelor’s degree in Electronics and Communication Engineering from NIT Raipur, India.
Sunil Goplani is a Group Development Manager at Intuit. Sunil has played key architecture and leadership roles building solutions around data platforms, Big Data, BI, Data Warehousing and MDM for startups and enterprises. Sunil is currently leading Big Data platform at Intuit. Prior to Intuit, Sunil served in key engineering positions at Netflix, Chegg, Brand.net and few other startups. Sunil has a Master’s degree in Computer Science.
Sandeep Uttamchandani is the hands-on Chief Data Architect at Intuit. He is currently leading the Cloud transformation of the Big Data Analytics, ML, and Transactional platform used by 4M+ Small Business Users for financial accounting, payroll, and billions of dollars in daily payments. Prior to Intuit, Sandeep has played various engineering roles at VMware, IBM, as well as founding a startup focused on ML for managing Enterprise systems. Sandeep’s experience uniquely combines building Enterprise data products and operational expertise in managing petabyte scale data and analytics platforms in production for IBM’s Federal and Fortune 100 customers. Sandeep has received several excellence awards, and over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, USENIX. Sandeep is a regular speaker at academic institutions, guest lectures for university courses, as well as conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as Program Committee Member for systems and data conferences, and the past associate editor for ACM Transactions on Storage. Sandeep is a Ph.D. in Computer Science from University of Illinois at Urbana-Champaign.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts