Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

Supporting reproducibility in Jupyter through dataflow notebooks

David Koop (University of Massachusetts Dartmouth)
1:50pm–2:30pm Friday, August 24, 2018
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Scientists, data analysts, developers, and notebook users

Prerequisite knowledge

  • A working knowledge of the Jupyter Notebook

What you'll learn

  • Learn a new way of writing and organizing notebooks and structuring cells around outputs
  • Understand the benefits of the dataflow structure

Description

Jupyter notebooks are an important tool for exploratory data analysis, but their ability to combine code, outputs, and text has also enabled their use to document and reproduce results. Notebooks can further play an important role in enabling reuse and extensions of those results; users can not only examine and reexecute past results but also tweak inputs and test new ideas. However, the ability to modify and execute any cell, at any place in the notebook, can sometimes conflict with the desire to present steps in an easy-to-follow, top-down ordering.

David Koop offers an overview of the Dataflow kernel, which combines the ability to tinker with the ability to define precise dependencies between cells. David shows how it can be used to robustly link cells as a notebook is developed and demonstrates how that notebook can be reused and extended without impacting its reproducibility. In addition, he explains the importance of intermediate outputs and how to structure cells to better define the boundaries between them and explores other extensions enabled by the dataflow structure and relationships to the JupyterLab environment.

A dataflow notebook promotes cells to be building blocks that can be linked together based on references to each other. More technically, a dataflow encodes dependencies between computations by defining how the outputs of one computation are inputs of another computation. For example, cell A might calculate a value, cell B might read an image, and cell C might use cell A’s output to blur cell B’s output to create a new image. If we change cell B to read a different image, it seems appropriate to also update cell C’s output. With robust dependencies, we can automatically execute such updates. This structure also allows users to examine how a change to one cell might impact results in other cells of the notebook. An interactive visualization of the dependency graph can help users navigate complex notebooks.

In a dataflow notebook, outputs (as defined by the last line of a cell) should reflect what was computed in a cell. Jupyter provides extensive capabilities for displaying and customizing outputs, but a cell with multiple outputs is often encapsulated as a tuple or other collection that limits these rich displays. To address this limitation, David shares extensions to the Jupyter notebook to render each output individually. For example, a tuple (df, im) would generate two outputs, the DataFrame df as an HTML table and the image im as an image. Combined with an identification scheme to reference the individual outputs, users can more clearly examine and link specific outputs to be used in other cells.

Photo of David Koop

David Koop

University of Massachusetts Dartmouth

David Koop is an assistant professor in the Computer and Information Science Department at UMass Dartmouth. His research interests include data visualization, computational provenance, and data science environments. He has served as a core developer for the VisTrails project and has collaborated with scientists in the fields of climate science, quantum physics, and invasive species modeling. David holds a PhD in computing from the University of Utah.