There has been an exponential rise in the adoption of data pipelines based on Hadoop and massively parallel processing (MPP) databases like Vertica and Redshift. The journey of automated testing in these data pipelines and other big data projects has been rough. To a large extent, the business logic is implemented in SQL scripts, and performing quality checks on these SQL scripts has been a manual process so far. Unit testing is nonexistent, and other excellence metrics like code coverage for SQL scripts are not clearly defined. The fact that most data engineers and analysts are usually more comfortable with SQL than other languages like Java or Python that have established testing standards is another challenge in moving toward automated testing. If you are building a data pipeline, you should be baking in these engineering best practices to ensure that it has an optimum business impact.
Avinash Padmanabhan describes how his team at Intuit is driving change in the way it builds and tests extract-transform-load (ETL) jobs. Avinash presents an automation solution that both data and quality engineers can use to build quality into the data pipeline, explaining how to use Docker to virtualize end-to-end data infrastructure pipelines inside local development environments in a way that requires low overhead and enables faster feedback, which allows problems to be fixed early in the development process versus late in the QA stage or, worse, in the production environment.
Avinash Padmanabhan is a staff quality engineer in Intuit’s Small Business Data and Analytics group, where he focuses on ensuring quality of the data pipeline that enables the work of analysts and business stakeholders. Avinash has over 12 years of experience specializing in building frameworks and solutions that solve challenging quality problems and delight customers. He holds a master’s degree in electrical and computer engineering from the State University of New York.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.