Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

Analysis using Jupyter Notebooks on the National Cancer Institute Cloud Resources

Moderated by: Hsinyi Tsang

The National Cancer Institute Cloud Resources, formerly the NCI Cancer Genomics Cloud Pilots, were developed with the goal of democratizing access to NCI-generated cancer genomic data and facilitating analysis by co-localizing petabyte-scale data with cloud computing resources. Based on commercial cloud architectures, the Cloud Resources offer users the flexibility and reproducibility of utilizing tools in the form of Docker containers, and tools can be joined to create workflows described by Common Workflow Language (CWL) or Workflow Description Language (WDL). In addition, two of the Cloud Resources support interactive analysis using Jupyter notebooks as an integrated feature on the platform. The Broad Institute’s FireCloud, built on the Google Cloud platform, integrated Jupyter Notebooks into workspaces. In these shareable computational sandboxes, researchers organize and store their genetic datasets, as well as run analysis workflows. With the addition of Notebooks, researchers can perform tertiary analysis with data stored in workspaces or any FireCloud-managed GCP resource without additional authentication. A Python FireCloud client (FISS) can be utilized to access workspaces or other FireCloud objects. On the Seven Bridges Cancer Genomics Cloud (CGC), researchers can use the JupyterLab environment for custom scripting in R, Python, and Julia through an interactive analysis feature called Data Cruncher. Data Cruncher is accessible through custom workspaces on the CGC, where researchers can organize files, run complementary analyses on AWS using both Dockerized tools and Data Cruncher, and share data, tools, and notebooks with collaborators. Both Data Cruncher notebooks and Dockerized tools can be used for collaboratively exploring and mining data that are publicly available through the CGC, including multi-omic datasets from the TCGA (The Cancer Genome Atlas) and TARGET (Therapeutically Applicable Research To Generate Effective Treatments) initiatives, as well as private data uploaded or generated by researchers. Through the Cloud Resources and these associated Jupyter Notebook and JupyterLab features, users can seamlessly integrate interactive, exploratory analysis with other types of cancer analysis pipelines.