Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Without automated tests, data pipelines often become deep stacks of unverified assumptions. Mysterious (and sometimes embarrassing) bugs crop up more and more frequently, and resolving them requires painstaking exploration of upstream data, often leading to frustrating negotiations about data specs across teams.
It’s not unusual to see data teams grind to a halt for weeks (or even months) to pay down accumulated pipeline debt. Servicing pipeline debt is one of the biggest productivity and morale killers on data teams. This work is never fun—after all, it’s just data cleaning: no new products shipped, no new insights kindled. Even worse, it’s recleaning old data that you thought you’d already dealt with.
Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Like assertions in traditional Python unit tests, expectations provide a flexible, declarative language for describing expected behavior. Unlike traditional unit tests, Great Expectations applies expectations to data instead of code. Great Expectations makes it easy to set up your testing framework early, capture findings while they’re still fresh, and systematically validate new data against them. It’s the best tool for managing the complexity that inevitably grows within data pipelines.
Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in healthcare, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.
James Campbell is a senior data scientist and researcher at the Laboratory for Analytical Sciences (LAS), a collaborative public-private research and development organization housed at NC State University. His current work focuses on measuring and enhancing analytic quality by weaving together traditional, human-centric analytic processes with predictive, model-driven analytic tools. He is one of the core contributors to the Great Expectations project. James has worked in government for more than a decade, leading significant data science tradecraft development efforts. He has managed multiple data science teams tackling a wide range of topics, including counterterrorism and information operations. His prior analytical experience includes strategic cyberthreat intelligence research and economic analysis for litigation. James holds a bachelor’s degree in math and philosophy from Yale and a master’s degree in security studies from Georgetown. James lives in Cary, North Carolina, with his wife, two daughters, and dog. He speaks Russian, enjoys running and cycling, and designs mathematical sculpture.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org