Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets.
The Allen Institute for Cell Science generates terabytes of microscopy images every week. To improve access to these datasets for data scientists and external collaborators, the institute sought a platform that would enable plain-text search, subsetting of large datasets, version control to support reproducible experiments, and easy accessibility from data science tools like Jupyter, Python, and pandas. The team discovered that software optimized for storing and versioning source code (e.g., GitHub) exhibits slow performance for large files and places hard limits on file size that preclude large data repositories altogether. In response, the team is creating an open repository of image data that is enriched with metadata and encapsulated in “data packages”—versioned, immutable sets of data dependencies.
The concept of package management is well known in software development. To date, however, package management has largely been applied to source code. Jackson and Aneesh propose to extend package management to the unique file size and format challenges of data by building on top of Quilt, an open source data registry. In combination with custom filtering software, Quilt enables efficient search and query of metadata so that data scientists can filter terabyte-sized packages into megabyte-size subsets that fit on a single machine. The package management infrastructure optimizes not only storage and network transfer but also serialization and virtualization. As a result, data scientists can interact with data packages in formats that are native to Jupyter and Python. Jackson and Aneesh also explore the role of data packages in versioning models and detecting model drift using “data unit tests” that check data profiles.
Jackson Brown is a research engineer working on data release infrastructure for the modeling team at the Allen Institute for Cell Science. He is also the cofounder of the Council Data Project, an organization working to enable better public transparency and discourse. Previously, he was a designer for SageMathCloud (CoCalc), a collaborative computation service.
Aneesh Karve is the CTO of Quilt Data, a Y Combinator company advancing an open source standard for versioned data. Previously, Aneesh was a product manager, lead designer, and software engineer at companies including Microsoft, NVIDIA, and Matterport and the general manager and founding member of AdJitsu, the first real-time 3D advertising platform for iOS (acquired by Amobee in 2012). He holds degrees in chemistry, mathematics, and computer science. Aneesh’s research background spans proteomics, machine learning, and algebraic number theory.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com