Jupyter notebooks are a popular option for sharing data science workflows. Daniel Mietchen shares best practices for reproducibility and other aspects of usability (documentation, ease of reuse, etc.) gleaned from analyzing Jupyter notebooks referenced in PubMed Central, an ongoing project that started at a hackathon earlier this year and is being documented on GitHub.
The aim of this project was to understand and document the extent to which these publicly accessible notebooks are reproducible, both individually and collectively. By identifying the existing barriers to reproducibility, the hope is to lower those barriers for notebooks shared in the future.
To find research articles with associated Jupyter notebooks, the team performed a search for “ipynb OR Jupyter” on PubMed Central, a full-text database of biomedical articles. This yielded approximately 100 articles, which were then screened for mentions of actual notebooks.
The association of articles with notebooks took multiple forms, such as screenshots, supplementary files, links to nbviewer, GitHub repositories, and individual notebooks on GitHub. For those notebooks that were available in an executable form, the team executed them in a clean Jupyter environment.
When executing notebooks, the team recorded whether they ran through without any errors and the first error message if there was one. Subsequently, they looked at individual errors and tried to resolve them. Most frequently, this involved fixing code and data dependencies. Lack of documentation was often a barrier to reproducibility, as was code that was dependent on the platform (e.g., shell commands) and use of non-Python software packages (e.g., Java). The next step is to analyze the remaining errors to deduce how they could be avoided and, for those notebooks that the team manage to execute, document whether the results match the ones originally reported in the associated papers.
Some notebooks examples include a containerized version, usually through Docker. In such cases, the team takes an approach similar to analyzing the notebooks themselves: attempting to build the container from scratch, run it, and document problems encountered on the way, as well as attempts to solve them.
Daniel Mietchen is a biophysicist interested in integrating research workflows with the World Wide Web, particularly through open licensing, open standards, public version histories, and forkability. With research activities spanning from the subcellular to the organismic level, from fossils to developing embryos and from insect larvae to elephants, he has experienced multiple shades of the research cycle and a variety of approaches to collaboration and sharing in research contexts. He has also been contributing to Wikipedia and its sister projects for more than a decade and is actively engaged in increasing the interactions between the Wikimedia and research communities.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org