Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Circuit breakers to safeguard for garbage in, garbage out

5:25pm–6:05pm Wednesday, 09/12/2018
Secondary topics:  Data Integration and Data Pipelines, Financial Services

Who is this presentation for?

  • DevOps engineers, big data engineers, and data architects

Prerequisite knowledge

  • A basic understanding of big data technologies and data pipelines

What you'll learn

  • Explore a circuit breaker pattern for building checks in your data pipeline to ensure reliable insights are generated for data analysts and data scientists

Description

Do your analysts always trust the insights generated by your data platform? Faced with an unexpected insight, does your analyst team spend time verifying data quality, ETL correctness, and job dependencies? As financial use cases increasingly combine social feeds, these verifications are extremely complex and nonscalable given the volume, velocity, and variety. Circuit breaker is a common design pattern used by software developers to ensure graceful handling of errors in a service-oriented architecture. Taking inspiration from this pattern, Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines that detects and corrects problems and ensures always reliable insights.

The process of converting data into insights involves a multistage pipeline with ingestion, cleansing, transformations, and analytical operations. Each stage implements a circuit breaker that continuously analyzes metrics and correctness rules. If any of these are violated, the circuit is broken, and processing does not progress to the next stage in the pipeline. The checks are a collection of runtime analysis for data quality, job health, and operational error logs from the analytical engines and data stores. The checks are implemented as a combination of domain-knowledge rules and machine learning for anomaly detection. Depending on the type of error, the circuit breaker framework attempts to either repair and reschedule the jobs or cancels the job with a user notification. Sandeep explains how this pattern was developed and how it is applied.

Photo of Sandeep Uttamchandani

Sandeep Uttamchandani

Intuit

Sandeep Uttamchandani is a chief data architect at Intuit, where he leads the cloud transformation of the big data analytics, ML, and transactional platform used by 4M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep was cofounder and CEO of a machine learning startup focused on ML for managing enterprise systems and played various engineering roles at VMware and IBM. His experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production. He’s received several excellence awards, over 40 issued patents , and 25 publications in key systems conferences such as the International Conference on Very Large Data Bases (VLDB), Special Interest Group on Management of Data (SIGMOD), Conference on Innovative Data Systems Research (CIDR), and USENIX. He’s a regular speaker at academic institutions, guest lectures for university courses, and conducts conference tutorials for data engineers and scientists, as well as advising PhD students and startups, serving as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He holds a PhD in computer science from the University of Illinois Urbana-Champaign.

Comments on this page are now closed.

Comments

Klyment Mamykin | VP DATA ENGINEERING
09/17/2018 5:48pm EDT

Hi, I am looking for the slide of this presentation. Are they available anywhere?