Brought to you by NumFOCUS Foundation and O’Reilly Media Inc.

The official Jupyter Conference

August 22-23, 2017: Training

August 23-25, 2017: Tutorials & Conference

New York, NY

Add to Your Schedule

SOLD OUT

Deploying interactive Jupyter dashboards for visualizing hundreds of millions of datapoints, in 30 lines of Python

James Bednar (Anaconda), Philipp Rudiger (Anaconda)

1:30pm–5:00pm Wednesday, August 23, 2017

Usage and application
Location: Concourse E Level: Intermediate

Average rating:

(4.50, 2 ratings)

Who is this presentation for?

Analysts, scientists, engineers, journalists, and data scientists

Prerequisite knowledge

A working knowledge of Python and the Jupyter Notebook
Familiarity with NumPy and pandas

Materials or downloads needed in advance

A laptop with a recent version of Jupyter and an Anaconda or Miniconda environment installed
Follow these instructions on how to download the tutorials, associated datasets, and required libraries. Please run these before you arrive to the conference.

What you'll learn

Learn how to build flexible visualizations using very little code, process and visualize very large datasets using Python, make reproducible notebooks, and deploy notebooks as dashboards

Description

The flexibility of Python and Jupyter notebooks makes it feasible to stitch together the various tools and libraries in the Python scientific software ecosystem to solve specific problems. However, it is often unclear how best to do so for specific cases, and a variety of technical problems typically arise in practice. For instance, one common use for such notebooks is to take a dataset of some type, filter or process it, visualize it, and share the results with colleagues. To achieve this goal, there are very many packages that might be relevant and even more possible combinations of those packages, each of which can present various practical problems that are daunting to overcome.

The amount of code involved quickly increases as more complex problems are addressed, making the notebooks unreadable and unmaintainable. To make the notebooks maintainable, general-purpose code can be extracted and put into separate Python modules, but doing so can be very difficult because of interactions between that code and domain-specific, widget-related, and visualization-related code, all of which tend to be intermingled in Jupyter Notebook visualizations. As soon as code is extracted into separate modules, reproducibility becomes difficult because of specific dependencies of versions of the notebook on versions of external libraries, making it hard for others to run your notebooks (and for yourself at later dates). Interactive notebook-based visualizations inherit the memory limitations of web browsers and thus work well for small datasets but struggle as datasets reach millions or billions of data points. Performance of Python-based solutions can be prohibitively slow, particularly when working with large datasets, making it tempting for users to switch to less-maintainable and extremely verbose solutions using compiled languages. Sharing the final results of an analysis is often difficult with people who do not work with Python, which can often require developing a separate web application when you need to deploy the results more widely.

James Bednar and Philipp Rudiger present an overall workflow for building interactive dashboards visualizing even billions of data points interactively in a Jupyter notebook, with graphical widgets allowing control over data selection, filtering, and display options, all using only a few dozen lines of code. This workflow is based on using the following open source Python packages in a Jupyter Notebook environment:

HoloViews and GeoViews: Declarative specification for visualizable/plottable objects
param and paramnb: Declarative specification for user-modifiable parameters
conda: Flexible dependency tracking for building reproducible environments
datashader: For rendering arbitrarily large datasets faithfully as fixed-size images
fastparquet: For fast reading of large files into memory
dask: For flexibly dispatching computational tasks to cores or processors
Numba: For compiling array-based Python code down to fast machine code
Bokeh: For building visualization-based web applications flexibly from Python
Jupyter Dashboards: For deploying Jupyter notebooks as web server applications

James and Philipp demonstrate how to use conda to coordinate versions of all these packages, Jupyter to stitch them together, fastparquet to load the large datasets quickly, HoloViews and GeoViews to store metadata with the data that supports automatic visualization later, Param to declare parameters and ranges of interest to the user independently of the notebook mechanisms, ParamNB to create ipywidgets automatically for these parameters, datashader to render the entire dataset into an image to avoid overwhelming the browser (and the user), dask to coordinate datashader’s computation across cores, Numba to accelerate this computation, Bokeh to deliver the visualization as an interactive figure, and Jupyter Dashboards to deploy the cells as a standalone web application that can be shared with colleagues. All of these steps rely only on freely available, domain-general libraries that each do one thing very well and work well with each other. The resulting workflow can easily be retargeted for novel analyses and visualizations of other datasets serving other purposes, making it practical to develop and deploy reproducible high-performance interactive visualizations in any domain using the Jupyter Notebook.

James Bednar

Anaconda

James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Website

Philipp Rudiger

Anaconda

Philipp Rudiger is a software developer at Anaconda, where he develops open source and client-specific software solutions for data management, visualization, and analysis. Philipp holds a PhD in computational modeling of the visual system.

Website

Elite Sponsors

Strategic Sponsor

Bloomberg

Contributing Sponsor

Impact Sponsor

Domino Data Lab

Supporting Sponsors

Premier Exhibitors

Innovators

Community Partners

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email jupytersponsorships@oreilly.com

Partner Opportunities

For information on trade opportunities with JupyterCon, email partners@oreilly.com

Contact Us

View a complete list of JupyterCon contacts

©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com