Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

SoS: A polyglot notebook and workflow system for both interactive multilanguage data analysis and batch data processing

Bo Peng (The University of Texas, MD Anderson Cancer Center)
5:00pm–5:40pm Thursday, August 23, 2018

Who is this presentation for?

  • Scientists who use multiple scripting language for data analysis

Prerequisite knowledge

  • Familiarity with the Jupyter platform (how to install and use kernels, how to use the Jupyter Notebook, etc.)

What you'll learn

  • Learn how to use SoS to analyze data using multiple scripting languages in one Jupyter notebook and how to convert scripts developed during interactive data analysis to workflows for batch data processing on remote high-performance computing systems

Description

Exploratory data analysis in computationally intensive disciplines often necessitates exploiting a variety of tools implemented in different programming languages and analyzing large datasets on high-performance computing systems (e.g., computer clusters). Despite the large number of kernels that Jupyter supports and the availability of magics for executing scripts in other languages, it remains challenging to use Jupyter to develop multilanguage data analysis workflows and streamline the analysis of large amount of data on remote systems.

Bo Peng offers an overview of Script of Scripts, a Python 3-based workflow engine with a Jupyter frontend that allows the use of multiple kernels in one notebook. As a workflow engine, SoS provides an intuitive syntax for creating workflows in process-based, outcome-oriented (makefile style), and mixed styles, as well as a unified interface for executing and managing tasks on a variety of computing platforms with automatic synchronization of files among isolated filesystems. As a ployglot notebook, SoS allows the use of multiple kernels in a single Jupyter notebook. In addition to magics such as %expand and %capture to compose scripts and capture outputs from all Jupyter kernels, SoS allows exchange of variables among kernels of supported languages. Other useful features of the SoS kernel include a side panel that allows scratch execution of statements, preview of files and expressions, and line-by-line execution of statements in cells. This unique combination enables users to analyze data using multiple scripting languages in one notebook and, if needed, convert scripts to workflows to analyze large amounts of data on remote systems.

Researchers benefit from the SoS workflow system and Jupyter kernel—they have the flexibility to use their preferred tools for tasks without having to worry about data flow and to perform light interactive analysis while executing heavy remote tasks simultaneous in the same notebook in a neat and organized fashion. SoS is distributed freely under a BSD license. A live Jupyter server and several Docker containers are provided for testing and running SoS easily. The SoS frontend is being ported to JupyterLab with a goal to release it with the release of JupyterLab 1.0.

Photo of Bo Peng

Bo Peng

The University of Texas, MD Anderson Cancer Center

Bo Peng is an assistant professor in the Department of Bioinformatics and Computational Biology at the University of Texas’s MD Anderson Cancer Center. Drawing on his background in mathematics, bioinformatics, and computer science, Bo applies advanced computational techniques (parallel computation, large-scale simulations) to research topics in population genetics, genetic epidemiology, and bioinformatics. He is the author of leading population genetics simulator simuPOP as well as software tools for the integrated annotation, manipulation, and analysis of genetic variants from whole exome and whole genome sequencing studies (Variant Tools), with Script of Scripts being his most recent project.

Comments on this page are now closed.

Comments

Picture of Bo Peng
Bo Peng | ASSISTANT PROFESSOR
08/22/2018 7:35am EDT

The github repo for the talk is https://github.com/vatlab/JupyterCon2018 and you can already find and execute all the examples from our live server http://vatlab.github.io/sos/live