Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Pipeline testing with Great Expectations

Abe Gong (Superconductive Health), James Campbell (USG)
5:10pm5:50pm Wednesday, March 7, 2018
Secondary topics:  Data Integration and Data Pipelines
Average rating: *****
(5.00, 4 ratings)

Who is this presentation for?

  • Data analysts, engineers, and scientists and anyone who manages data teams

Prerequisite knowledge

  • Familiarity with data pipelines and Python (useful but not required)

What you'll learn

  • Explore Great Expectations, an open source Python framework for bringing data pipelines and products under test

Description

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Without automated tests, data pipelines often become deep stacks of unverified assumptions. Mysterious (and sometimes embarrassing) bugs crop up more and more frequently, and resolving them requires painstaking exploration of upstream data, often leading to frustrating negotiations about data specs across teams.

It’s not unusual to see data teams grind to a halt for weeks (or even months) to pay down accumulated pipeline debt. Servicing pipeline debt is one of the biggest productivity and morale killers on data teams. This work is never fun—after all, it’s just data cleaning: no new products shipped, no new insights kindled. Even worse, it’s recleaning old data that you thought you’d already dealt with.

Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Like assertions in traditional Python unit tests, expectations provide a flexible, declarative language for describing expected behavior. Unlike traditional unit tests, Great Expectations applies expectations to data instead of code. Great Expectations makes it easy to set up your testing framework early, capture findings while they’re still fresh, and systematically validate new data against them. It’s the best tool for managing the complexity that inevitably grows within data pipelines.

Photo of Abe Gong

Abe Gong

Superconductive Health

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in healthcare, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Photo of James Campbell

James Campbell

USG

James Campbell is a senior data scientist and researcher at the Laboratory for Analytical Sciences (LAS), a collaborative public-private research and development organization housed at NC State University. His current work focuses on measuring and enhancing analytic quality by weaving together traditional, human-centric analytic processes with predictive, model-driven analytic tools. He is one of the core contributors to the Great Expectations project. James has worked in government for more than a decade, leading significant data science tradecraft development efforts. He has managed multiple data science teams tackling a wide range of topics, including counterterrorism and information operations. His prior analytical experience includes strategic cyberthreat intelligence research and economic analysis for litigation. James holds a bachelor’s degree in math and philosophy from Yale and a master’s degree in security studies from Georgetown. James lives in Cary, North Carolina, with his wife, two daughters, and dog. He speaks Russian, enjoys running and cycling, and designs mathematical sculpture.