Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science

Jackson Brown (Allen Institute for Cell Science), Aneesh Karve (Quilt)
11:05am–11:45am Friday, August 24, 2018
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Data scientists, data engineers, and researchers

Prerequisite knowledge

  • A working knowledge of Python
  • A basic understanding of the command line (useful but not required)

What you'll learn

  • Learn how to version, distribute, and reproduce large datasets for modeling and analysis; version and rehydrate machine learning models; and package, tag, document, and update datasets in a collaborative environment
  • Gain insights into the goals and practice of open science at the Allen Institute for Cell Science

Description

Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets.

The Allen Institute for Cell Science generates terabytes of microscopy images every week. To improve access to these datasets for data scientists and external collaborators, the institute sought a platform that would enable plain-text search, subsetting of large datasets, version control to support reproducible experiments, and easy accessibility from data science tools like Jupyter, Python, and pandas. The team discovered that software optimized for storing and versioning source code (e.g., GitHub) exhibits slow performance for large files and places hard limits on file size that preclude large data repositories altogether. In response, the team is creating an open repository of image data that is enriched with metadata and encapsulated in “data packages”—versioned, immutable sets of data dependencies.

The concept of package management is well known in software development. To date, however, package management has largely been applied to source code. Jackson and Aneesh propose to extend package management to the unique file size and format challenges of data by building on top of Quilt, an open source data registry. In combination with custom filtering software, Quilt enables efficient search and query of metadata so that data scientists can filter terabyte-sized packages into megabyte-size subsets that fit on a single machine. The package management infrastructure optimizes not only storage and network transfer but also serialization and virtualization. As a result, data scientists can interact with data packages in formats that are native to Jupyter and Python. Jackson and Aneesh also explore the role of data packages in versioning models and detecting model drift using “data unit tests” that check data profiles.

Photo of Jackson Brown

Jackson Brown

Allen Institute for Cell Science

Jackson Brown is a research engineer working on data release infrastructure for the modeling team at the Allen Institute for Cell Science. He is also the cofounder of the Council Data Project, an organization working to enable better public transparency and discourse. Previously, he was a designer for SageMathCloud (CoCalc), a collaborative computation service.

Photo of Aneesh Karve

Aneesh Karve

Quilt

Aneesh Karve is the CTO of Quilt Data, a Y Combinator company advancing an open source standard for versioned data. Previously, Aneesh was a product manager, lead designer, and software engineer at companies including Microsoft, NVIDIA, and Matterport and the general manager and founding member of AdJitsu, the first real-time 3D advertising platform for iOS (acquired by Amobee in 2012). He holds degrees in chemistry, mathematics, and computer science. Aneesh’s research background spans proteomics, machine learning, and algebraic number theory.

Comments on this page are now closed.

Comments

Picture of Aneesh Karve
Aneesh Karve | COFOUNDER AND CTO
08/25/2018 8:36pm EDT

Here are the slides. I had also posted them to the portal, and I’m not sure why they didn’t show up.

https://github.com/quiltdata/jupytercon

Philipp Kats | DATA SCIENTIST
08/25/2018 7:51am EDT

Great presentation! Is there any chance to get slides deck?
Thanks in advance!