Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Jupyter/IPython: Scaling Data Science to Organizations and the Open Web

Brian Granger (Cal Poly San Luis Obispo), Fernando Perez (UC Berkeley and Lawrence Berkeley National Laboratory), Kyle Kelley (Netflix)
4:50pm–5:30pm Thursday, 02/19/2015
Data Science
Location: LL20 A
Average rating: ****.
(4.00, 7 ratings)

Data Science, in virtual all contexts (for-profit companies, university classrooms, or the open web) is a collaborative activity. Data Scientists work together to munge data, build models, perform statistical analysis and create visualizations. The results of their work need to be shared with different audiences who themselves need to interact with the results in meaningful ways. In this talk we describe recent efforts in the IPython/Jupyter project to address the pain points of collaborative data science in organizations and the open web.

The Jupyter/IPython project is a set of open-source software projects for interactive and exploratory computing. These projects make data science reproducible and multi-language (Python, Julia, R, etc.). The main application developed by the project is the Jupyter/IPython Notebook, a web-based interactive computing platform that allows users to author data- and code-driven narratives that combine live code, equations, narrative text, visualizations, interactive dashboards and other media. These documents provide a complete record of a computation that can be shared with others.

In the past, the Notebook had been a single user web-app deployed by individual users on their own laptops and desktops. This model limits collaboration possibilities and makes it difficult to run the Notebook for larger groups of users. To address these limitations we have built a multiuser version of the Notebook that is designed to be deployed centrally within organizations and on clouds.

This multiuser notebook consists of a set of flexible building blocks that can be assembled in different way. Some of these building blocks include 1) a dynamic proxy for routing web traffic to individual notebook servers, 2) extensible user management and authentication that works with Linux user accounts, OAuth and other systems and 3) extensible methods of spawning individual notebook servers using subprocesses, SSH or Docker.

We will demonstrate a number of different way of deploying these building blocks for different usage cases.

First, we will talk about deploying the multiuser server internally within organizations to groups of trusted users, such as a data science group within a company. In this context, each user will have access to the Notebooks and files in their Linux home directory. The server will support collaboration features based on the Linux permissions model.

Second, we will talk about deploying on the open web. In this context, user accounts can either be anonymous or handled by OAuth and each user is sandboxed inside a Docker container.

For both of these deployment scenarios we will discuss the performance, security and scalability issues. We will conclude by describing future directions for collaborative data science with the Notebook.

Photo of Brian Granger

Brian Granger

Cal Poly San Luis Obispo

Brian Granger is an Associate Professor of Physics at Cal Poly State
University in San Luis Obispo, CA. He has a background in theoretical physics, with a Ph.D from the University of Colorado. His current research interests include quantum computing, parallel and distributed computing and interactive computing environments for scientific computing and data science. He is a leader of the IPython project, co-founder of Project Jupyter and is an active contributor to a number of other open source projects focused on data science in Python. He is a board member of the NumFocus Foundation and a fellow at Cal Poly’s Center for Innovation and Entrepreneurship. He is @ellisonbg on Twitter and GitHub.

Photo of Fernando Perez

Fernando Perez

UC Berkeley and Lawrence Berkeley National Laboratory

Fernando Perez is a research scientist at the UC Berkeley Helen Wills
Neuroscience Institute, and a founding investigator of the Berkeley Institute
for Data Science, created in 2013. He received a PhD in particle physics,
followed by postdoctoral research in applied mathematics, developing numerical algorithms. Today, his research focuses on creating tools for modern computational research and data science across domain disciplines, with an emphasis on high-level languages, literate computing and reproducible research.

He created IPython while a graduate student in 2001 and continues to lead it as it evolves into the Jupyter Project, now as a collaborative effort with a
talented team that does all the hard work. He regularly lectures about
scientific computing and data science, and is a member of the Python Software Foundation as well as a founding director member of the Numfocus Foundation. He is the recipient of the 2012 Award for the Advancement of Free Software from the Free Software Foundation.

Photo of Kyle Kelley

Kyle Kelley

Netflix

Kyle Kelley is a senior software engineer at Netflix, a maintainer on nteract.io, and a core developer of the IPython/Jupyter project. He wants to help build great environments for collaborative analysis, development, and production workloads for everyone, from small teams to massive scale.