Brought to you by NumFOCUS Foundation and O’Reilly Media

The official Jupyter Conference

Aug 21-22, 2018: Training

Aug 22-24, 2018: Tutorials & Conference

New York, NY

Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science

Jackson Brown (Allen Institute for Cell Science), Aneesh Karve (Quilt)

11:05am–11:45am Friday, August 24, 2018

Enterprise and organizational adoption, Extensions and customization, Reproducible research and open science
Location: Sutton Center/Sutton South Level: Intermediate

Average rating:

(5.00, 3 ratings)

View slides

Who is this presentation for?

Data scientists, data engineers, and researchers

Prerequisite knowledge

A working knowledge of Python
A basic understanding of the command line (useful but not required)

What you'll learn

Learn how to version, distribute, and reproduce large datasets for modeling and analysis; version and rehydrate machine learning models; and package, tag, document, and update datasets in a collaborative environment
Gain insights into the goals and practice of open science at the Allen Institute for Cell Science

Description

Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets.

The Allen Institute for Cell Science generates terabytes of microscopy images every week. To improve access to these datasets for data scientists and external collaborators, the institute sought a platform that would enable plain-text search, subsetting of large datasets, version control to support reproducible experiments, and easy accessibility from data science tools like Jupyter, Python, and pandas. The team discovered that software optimized for storing and versioning source code (e.g., GitHub) exhibits slow performance for large files and places hard limits on file size that preclude large data repositories altogether. In response, the team is creating an open repository of image data that is enriched with metadata and encapsulated in “data packages”—versioned, immutable sets of data dependencies.

The concept of package management is well known in software development. To date, however, package management has largely been applied to source code. Jackson and Aneesh propose to extend package management to the unique file size and format challenges of data by building on top of Quilt, an open source data registry. In combination with custom filtering software, Quilt enables efficient search and query of metadata so that data scientists can filter terabyte-sized packages into megabyte-size subsets that fit on a single machine. The package management infrastructure optimizes not only storage and network transfer but also serialization and virtualization. As a result, data scientists can interact with data packages in formats that are native to Jupyter and Python. Jackson and Aneesh also explore the role of data packages in versioning models and detecting model drift using “data unit tests” that check data profiles.

Jackson Brown

Allen Institute for Cell Science

Jackson Brown is a research engineer working on data release infrastructure for the modeling team at the Allen Institute for Cell Science. He is also the cofounder of the Council Data Project, an organization working to enable better public transparency and discourse. Previously, he was a designer for SageMathCloud (CoCalc), a collaborative computation service.

Website

Aneesh Karve

Quilt

Aneesh Karve is the CTO of Quilt Data, a Y Combinator company advancing an open source standard for versioned data. Previously, Aneesh was a product manager, lead designer, and software engineer at companies including Microsoft, NVIDIA, and Matterport and the general manager and founding member of AdJitsu, the first real-time 3D advertising platform for iOS (acquired by Amobee in 2012). He holds degrees in chemistry, mathematics, and computer science. Aneesh’s research background spans proteomics, machine learning, and algebraic number theory.

Website

Comments on this page are now closed.

Comments

Aneesh Karve | COFOUNDER AND CTO

08/25/2018 8:36pm EDT

Here are the slides. I had also posted them to the portal, and I’m not sure why they didn’t show up.

https://github.com/quiltdata/jupytercon

Philipp Kats | DATA SCIENTIST

08/25/2018 7:51am EDT

Great presentation! Is there any chance to get slides deck?
Thanks in advance!

Presented by

Strategic Sponsors

Premier Exhibitors

Supporting Sponsor

Diversity and Inclusion Sponsor

Innovator

Non-Profit Exhibitor

Community Partners

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email jupytersponsorships@oreilly.com

Partner Opportunities

For information on trade opportunities with JupyterCon, email partners@oreilly.com

Contact Us

View a complete list of JupyterCon contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com