Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

PyData at Strata (Full Day)

T.J. Alumbaugh (Continuum Analytics), James Powell (NumFOCUS), Bryan Van de Ven (Continuum Analytics), Sarah Bird (Continuum Analytics), Jake Vanderplas (eScience Institute, University of Washington), Katrina Riehl (Continuum Analytics)
9:00am–5:00pm Tuesday, 03/29/2016
Average rating: ****.
(4.33, 18 ratings)

Prerequisite knowledge


If you are registered for this tutorial, please download and install Anaconda BEFORE you arrive onsite, for the scikit-learn section.

For "Intro to data visualization with Bokeh," attendees should have Python and Bokeh installed on their system. The simplest way to obtain both is to install the Anaconda Python distribution, which comes with Bokeh and all of its dependencies (Full installation instructions here).



Python has become an increasingly important part of the data-engineer and analytic-tool landscapes. PyData at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including IPython Notebook, NumPy/matplotlib, SciPy, and scikit-learn, and explores how to scale Python performance, including handling large, distributed datasets. Come see how the leading lights in the Python data community are making Python ever more useful to data analysts and data engineers.


9:00 AM – 10:30 AM
Data wrangling and intro to pandas
T.J. Alumbaugh and James Powell

T.J. Alumbaugh and James Powell offer a brief tour of the data ingest and data exploration capabilities found in the Python language. We’ll explore a few datasets using pandas, the Jupyter notebook, and the matplotlib plotting package and learn some basic methods of how to clean up real data found in the wild. Then, we’ll do a few ad hoc analyses to explore the datasets. This is a use case where the PyData stack really shines.

10:30 AM – 11:00 AM

11:00 AM – 11:30 AM
Data wrangling and intro to pandas (continued)

11:30 AM – 12:30 PM
Intro to data visualization with Bokeh
Bryan Van de Ven and Sarah Bird

Bokeh allows you to build interactive visualizations for the Web in Python. It has a range of capabilities from quick “one-line” charts to streaming datasets to integrating with your existing plot libraries such as matplotlib or ggplot. Bryan Van de Ven and Sarah Bird give a quick hands-on introduction to Bokeh’s core features. We’ll do exercises building up a variety of visualizations and finish up discussing topics and questions from participants related to their own datasets and needs.

12:30 PM – 1:30 PM

1:30 PM – 2:30 PM
Intro to data visualization with Bokeh (continued)

2:30 PM – 3:00 PM
Intro to machine learning with scikit-learn
Jake Vanderplas and Katrina Riehl

Jake Vanderplas and Katrina Riehl offer an introduction to the core concepts of machine learning and the scikit-learn package. After introducing the scikit-learn API, we’ll use it to explore the basic categories of machine-learning problems and related topics such as feature selection and model validation and practice applying these tools to real-world datasets.

3:00 PM – 3:30 PM

3:30 PM – 5:00 PM
Intro to machine learning with scikit-learn (continued)

Photo of T.J. Alumbaugh

T.J. Alumbaugh

Continuum Analytics

T.J. Alumbaugh is a developer at Continuum Analytics. He likes array-oriented computing, Python, and C++.

Photo of James Powell

James Powell


James Powell is a NYC-based Python programmer and master trainer with experience in quantitative finance and data science. James is very active in the Python community in NYC, where he organizes NYC Python (the world’s largest and most active Python meetup group). He also works with the numeric and scientific computing nonprofit NumFOCUS to help organize the PyData conference series. James is a frequent speaker at Python conferences and has been invited to speak at events such as PyData New York, PyData London, PyGotham, the conference For Python Quants, and PyCon Spain.

Photo of Bryan Van de Ven

Bryan Van de Ven

Continuum Analytics

Bryan Van de Ven is a software engineer at Continuum Analytics. Previously, Bryan worked at the Applied Research Labs, developing software for sonar feature detection and classification systems on US Naval submarine platforms, and Enthought, where he worked on problems in financial risk modeling and fluid mixing simulation. Bryan has also worked on an assortment of iOS projects as an independent consultant. Bryan is a core contributor of Bokeh and contributed to the Chaco visualization library. Bryan holds undergraduate degrees in computer science and mathematics from UT Austin and a master’s degree in physics from UCLA.

Photo of Sarah Bird

Sarah Bird

Continuum Analytics

Sarah Bird is a software engineer at Continuum Analytics. She has been a core Bokeh developer since 2015 and has given numerous talks and tutorials on Bokeh. Previously, she worked at Aptivate as a full stack web developer building IT solutions for the international development sector. She has worked in a variety of sectors from systems engineering for ejection seats to mobile health and data collection in Pakistan. Sarah holds a master’s degree in mechanical engineering from Cambridge University and a masters of science in technology and policy from the Massachusetts Institute of Technology.

Photo of Jake Vanderplas

Jake Vanderplas

eScience Institute, University of Washington

Jake Vanderplas is the director of research in the physical sciences at the University of Washington’s eScience Institute, where his research is primarily in the area of data-driven astronomy and astrophysics. In addition, Jake is a maintainer and/or frequent contributor to many open source Python projects, including scikit-learn, scipy, mpld3, and others. He occasionally blogs about Python, machine learning, data visualization, open science, and related topics at

Photo of Katrina Riehl

Katrina Riehl

Continuum Analytics

Katrina Riehl is a senior data scientist at Continuum Analytics, where she leads the Memex team. Over the last decade, Katrina has worked extensively in the fields of scientific computing, machine learning, data mining, and visualization. Most notably, she worked at Enthought, the signal and information sciences laboratory at the Applied Research Laboratories of the University of Texas at Austin, and Apple before joining Continuum Analytics. Katrina received her MS and PhD in computer science from the University of Texas at Dallas.

Comments on this page are now closed.


Picture of Phillip Burger
03/29/2016 2:58am PDT

The pandas-datareader package is not included as part of the Anaconda distribution. In line 6 of the strata_pandas.ipynb, error occurs. If you want to use the feaures in this module, you need to load the package. Here is the conda command to download and install the module:

conda install -c pandas-datareader

If you execute the command ‘conda list’ before and after the install, you’ll see that the package is not there at first, then is present after the install.