Exploratory data analysis in computationally intensive disciplines often necessitates exploiting a variety of tools implemented in different programming languages and analyzing large datasets on high-performance computing systems (e.g., computer clusters). Despite the large number of kernels that Jupyter supports and the availability of magics for executing scripts in other languages, it remains challenging to use Jupyter to develop multilanguage data analysis workflows and streamline the analysis of large amount of data on remote systems.
Bo Peng offers an overview of Script of Scripts, a Python 3-based workflow engine with a Jupyter frontend that allows the use of multiple kernels in one notebook. As a workflow engine, SoS provides an intuitive syntax for creating workflows in process-based, outcome-oriented (makefile style), and mixed styles, as well as a unified interface for executing and managing tasks on a variety of computing platforms with automatic synchronization of files among isolated filesystems. As a ployglot notebook, SoS allows the use of multiple kernels in a single Jupyter notebook. In addition to magics such as %expand and %capture to compose scripts and capture outputs from all Jupyter kernels, SoS allows exchange of variables among kernels of supported languages. Other useful features of the SoS kernel include a side panel that allows scratch execution of statements, preview of files and expressions, and line-by-line execution of statements in cells. This unique combination enables users to analyze data using multiple scripting languages in one notebook and, if needed, convert scripts to workflows to analyze large amounts of data on remote systems.
Researchers benefit from the SoS workflow system and Jupyter kernel—they have the flexibility to use their preferred tools for tasks without having to worry about data flow and to perform light interactive analysis while executing heavy remote tasks simultaneous in the same notebook in a neat and organized fashion. SoS is distributed freely under a BSD license. A live Jupyter server and several Docker containers are provided for testing and running SoS easily. The SoS frontend is being ported to JupyterLab with a goal to release it with the release of JupyterLab 1.0.
Bo Peng is an assistant professor in the Department of Bioinformatics and Computational Biology at the University of Texas’s MD Anderson Cancer Center. Drawing on his background in mathematics, bioinformatics, and computer science, Bo applies advanced computational techniques (parallel computation, large-scale simulations) to research topics in population genetics, genetic epidemiology, and bioinformatics. He is the author of leading population genetics simulator simuPOP as well as software tools for the integrated annotation, manipulation, and analysis of genetic variants from whole exome and whole genome sequencing studies (Variant Tools), with Script of Scripts being his most recent project.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org