Brought to you by NumFOCUS Foundation and O’Reilly Media Inc.
The official Jupyter Conference
August 22-23, 2017: Training
August 23-25, 2017: Tutorials & Conference
New York, NY
Please log in

Postpublication peer review of Jupyter notebooks referenced in articles on PubMed Central

Daniel Mietchen (University of Virginia)
4:10pm–4:50pm Friday, August 25, 2017
Reproducible research and open science
Location: Murray Hill Level: Intermediate

Who is this presentation for?

  • Researchers, data librarians, reviewers, publishers, and research administrators

Prerequisite knowledge

  • Basic familiarity with the Jupyter Notebook and the concept and practicalities of reproducibility in research

What you'll learn

  • Understand why efforts are needed to improve and standardize the way Jupyter notebooks and associated containers are shared along with published research articles and what mechanisms (e.g., badges) should be explored to signal whether a given notebook has passed such a standardized procedure

Description

Jupyter notebooks are a popular option for sharing data science workflows. Daniel Mietchen shares best practices for reproducibility and other aspects of usability (documentation, ease of reuse, etc.) gleaned from analyzing Jupyter notebooks referenced in PubMed Central, an ongoing project that started at a hackathon earlier this year and is being documented on GitHub.

The aim of this project was to understand and document the extent to which these publicly accessible notebooks are reproducible, both individually and collectively. By identifying the existing barriers to reproducibility, the hope is to lower those barriers for notebooks shared in the future.

To find research articles with associated Jupyter notebooks, the team performed a search for “ipynb OR Jupyter” on PubMed Central, a full-text database of biomedical articles. This yielded approximately 100 articles, which were then screened for mentions of actual notebooks.

The association of articles with notebooks took multiple forms, such as screenshots, supplementary files, links to nbviewer, GitHub repositories, and individual notebooks on GitHub. For those notebooks that were available in an executable form, the team executed them in a clean Jupyter environment.

When executing notebooks, the team recorded whether they ran through without any errors and the first error message if there was one. Subsequently, they looked at individual errors and tried to resolve them. Most frequently, this involved fixing code and data dependencies. Lack of documentation was often a barrier to reproducibility, as was code that was dependent on the platform (e.g., shell commands) and use of non-Python software packages (e.g., Java). The next step is to analyze the remaining errors to deduce how they could be avoided and, for those notebooks that the team manage to execute, document whether the results match the ones originally reported in the associated papers.

Some notebooks examples include a containerized version, usually through Docker. In such cases, the team takes an approach similar to analyzing the notebooks themselves: attempting to build the container from scratch, run it, and document problems encountered on the way, as well as attempts to solve them.

The project is an ongoing collaborative effort and is being documented in an open science manner on GitHub, as is this submission. Thanks to everyone who has contributed so far.

Photo of Daniel Mietchen

Daniel Mietchen

University of Virginia

Daniel Mietchen is a biophysicist interested in integrating research workflows with the World Wide Web, particularly through open licensing, open standards, public version histories, and forkability. With research activities spanning from the subcellular to the organismic level, from fossils to developing embryos and from insect larvae to elephants, he has experienced multiple shades of the research cycle and a variety of approaches to collaboration and sharing in research contexts. He has also been contributing to Wikipedia and its sister projects for more than a decade and is actively engaged in increasing the interactions between the Wikimedia and research communities.