Solving data science problems involves some tasks common to many projects and situations, plus custom requirements that differ for each specific application. With Python, these common elements can often be handled by packages already available in the Python software ecosystem. The data scientist then simply writes custom code (typically in a Jupyter notebook) to stitch them together and finish the task. This approach can handle a wide range of problems while requiring relatively little software development skill or effort. However, it is often unclear how to select the right set of packages for a particular problem, and a variety of technical issues typically arise in practice.
As a concrete example, a very common use for a data science notebook is to take a dataset of some type, filter or process it, visualize it, and share the results with colleagues. To achieve this seemingly straightforward goal, there are very many packages that might be relevant and even more possible combinations of those packages, each of which can present various practical problems that are daunting to overcome:
The new PyViz.org initiative is designed to eliminate these difficulties by streamlining differences and incompatibilities between many of the packages, providing additional functionality where necessary to optimize key steps and providing a comprehensive set of examples and tutorials that show how to put the packages together into solutions for real problems.
James Bednar guides you through an overall workflow for building interactive notebooks and dashboards visualizing even billions of data points interactively, with graphical widgets allowing custom control over data selection, filtering, and display options—all using only a few dozen lines of code. James also demonstrates how the same approach can be used to make it simple to work with live streaming data, complex custom interactivity, very-high-dimensional datasets, and geographic data. This workflow is based on using the following open source Python packages in a Jupyter Notebook environment, each labeled with the problem(s) it addresses from the above list:
James demonstrates how to use Conda to coordinate versions of all these packages, Jupyter to stitch them together, fastparquet to load the large datasets quickly, HoloViews and GeoViews to attach metadata to the data that supports automatic visualization later, Param to declare parameters and ranges of interest to the user independently of the notebook mechanisms that will later become widgets automatically, Datashader to render the entire dataset into an image to avoid overwhelming the browser (and the user), Dask to coordinate Datashader’s computation across cores, Numba to accelerate this computation, Bokeh to deliver the visualization as an interactive figure, and Bokeh Server to deploy the cells as a standalone web application that can be shared with colleagues. All of these steps rely only on freely available, domain-general libraries that each do one thing very well and are designed to work well with each other. The resulting workflow can easily be retargeted for novel analyses and visualizations of other datasets serving other purposes, making it practical to develop and deploy concise, reproducible high-performance interactive visualizations in any domain using Python.
James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com