Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Shifting left for continuous quality in an Agile data world

1:50pm2:30pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Data Platform, Financial services
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Beginner or intermediate data and quality engineers

Prerequisite knowledge

  • An understanding of how a typical data pipeline works
  • Familiarity with the engineering processes that are involved in extracting, transforming, and loading data

What you'll learn

  • Learn best practices of the Agile world for data projects
  • Understand how to enable developers and quality engineers to inject quality in the data pipeline much earlier in the process
  • Explore how to virtualize the end-to-end data pipeline using Docker in a way that does not cause too much overhead and greatly simplifies local testing and enables faster feedback

Description

There has been an exponential rise in the adoption of data pipelines based on Hadoop and massively parallel processing (MPP) databases like Vertica and Redshift. The journey of automated testing in these data pipelines and other big data projects has been rough. To a large extent, the business logic is implemented in SQL scripts, and performing quality checks on these SQL scripts has been a manual process so far. Unit testing is nonexistent, and other excellence metrics like code coverage for SQL scripts are not clearly defined. The fact that most data engineers and analysts are usually more comfortable with SQL than other languages like Java or Python that have established testing standards is another challenge in moving toward automated testing. If you are building a data pipeline, you should be baking in these engineering best practices to ensure that it has an optimum business impact.

Avinash Padmanabhan describes how his team at Intuit is driving change in the way it builds and tests extract-transform-load (ETL) jobs. Avinash presents an automation solution that both data and quality engineers can use to build quality into the data pipeline, explaining how to use Docker to virtualize end-to-end data infrastructure pipelines inside local development environments in a way that requires low overhead and enables faster feedback, which allows problems to be fixed early in the development process versus late in the QA stage or, worse, in the production environment.

Photo of Avinash Padmanabhan

Avinash Padmanabhan

Intuit

Avinash Padmanabhan is a staff quality engineer in Intuit’s Small Business Data and Analytics group, where he focuses on ensuring quality of the data pipeline that enables the work of analysts and business stakeholders. Avinash has over 12 years of experience specializing in building frameworks and solutions that solve challenging quality problems and delight customers. He holds a master’s degree in electrical and computer engineering from the State University of New York.