Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

PyData at Strata

Andreas Mueller (NYU, scikit-learn), Jennifer Klay (Cal Poly San Luis Obispo), Peter Wang (Anaconda), Travis Oliphant (Anaconda), Andy Terrel (NumFOCUS), Matthew Rocklin (Anaconda), Wes McKinney (Two Sigma Investments), Stefan van der Walt (UC Berkeley), Jonathan Frederic (IPython), Kyle Kelley (Netflix)
9:00am–5:00pm Wednesday, 02/18/2015
Data Science
Location: LL21 B
Average rating: ****.
(4.62, 8 ratings)
Slides:   1-PDF 



Python has become an increasingly important part of the data engineer and analytic tool landscape. Pydata at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including iPython Notebook, NumPy/matplotlib for visualization, SciPy, scikit-learn, and how to scale Python performance, including how to handle large, distributed data sets. Come see how the leading lights in the Python data community are making Python ever more useful to data analysts and data engineers.


Please note – tutorial prerequisites / course materials listed below

9:00am – 10:30am

Track 1 (room LL21 B):

  • Machine Learning with scikit-learn
    Andreas Mueller

scikit-learn has emerged as one of the most popular open source machine learning toolkits,
now widely used in academia and industry. scikit-learn provides easy-to-use interfaces to perform advanced analysis and build powerful predictive models. The tutorial will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross validation and model selection. We will see how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline.

Material (probably notebooks and slides) will be made available here:

Track 2 (LL21 A):

  • Slicing Through Data with NumPy
    Jennifer Klay

In this tutorial, attendees will learn the basics of the NumPy library and how it can be used to enable fast analysis of a wide spectrum of numerical datasets.  Basics of array creation, slicing, broadcasting, and masking will be introduced and several hands-on examples of their use in data analysis will be shown.  The tutorial will be presented using Python 2.7 in the IPython notebook.  A basic installation of Anaconda should be sufficient.  

An IPython notebook will be provided on Github but I don’t have a repo set up yet.  The foundation from which I will create the tutorial will be materials from my Computing4Physics course:

11:00am – 12:30pm

Track 1 (room LL21 B):

  • Interactive Web Graphics with Bokeh
    Peter Wang

Bokeh is an open-source library for building web graphics, ranging from simple interactive plots to complex dashboards with streaming data sources.  This is tutorial will quickly introduce some of the basic concepts behind Bokeh and then dive into a step-by-step series of exercises which showcase how to embed interactive graphics in an IPython notebook and build more complex linked graphics.  Streaming and large datasets will also be demonstrated.

Track 2 (LL21 A):

  • IPython

1:30pm – 3:00pm

Track 1 (room LL21 B):

  • Intro to Numba and Performance Python
    Travis Oliphant

Numba is a just-in-time compiler for Python that can translate a wide range
of Python functions into high performance machine code at runtime. This
tutorial will give an overview of the capabilities of the Numba compiler and
walk through several examples showing how to use Numba to generate fast
implementations of numerical algorithms from pure Python. We will briefly
touch on more advanced features of Numba, such as compiling for the GPU, at
the end.

A basic installation of Anaconda. Example IPython notebooks will be posted to
GitHub before the tutorial.
Track 2 (LL21 A):

  • Python Data Applications with Blaze and Bokeh
    Andy Terrel and Matthew Rocklin

We use the Blaze and Bokeh libraries to interactively query and visualize large datasets through Python.

Blaze provides a consistent query experience on data ranging from a small local CSV files to a large remote Impala or Spark clusters.  It automates data migration and brings the power of other database systems into the hands of the armchair analyst.

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation.  It provides elegant, concise construction of novel graphics in the style of D3.js, but also delivers this capability with high-performance interactivity over large or streaming datasets.

Tutorial materials at

We also recommend downloading Anaconda

3:30pm – 5:00pm

Track 1 (room LL21 B):

  • Analytics Beyond the Basics with pandas and SQL
    Wes McKinney

In this tutorial, we’ll take a tour through a variety of useful, but sometimes tricky analytical tasks and show how they can be tackled with pandas or SQL. A part of the goal is to illustrate how SQL concepts map onto the pandas API and vice versa, and for the participant to learn more about advanced usage of each of the tools.

Materials will be posted at

Track 2 (LL21 A):

  • Parsing Pixels with scikit-image
    Stefan van der Walt

Images are information rich, yet while humans interpret them
effortlessly, doing so algorithmically remains, paradoxically, hard.
In this tutorial, I introduce scikit-image and show how to use it for
extracting different types of features, such as region properties,
corners, segments, blobs or lines.  Based on these features, I present
solutions to real-world problems.  Finally, participants are guided in
using scikit-image to solve a proposed challenge.

Photo of Andreas Mueller

Andreas Mueller

NYU, scikit-learn

Andreas Mueller received his PhD in machine learning from the University of Bonn. After working as a machine learning researcher on computer vision applications at Amazon for a year, he recently joined the Center for Data Science at New York University. In the last four years, he has been maintainer and one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, and author and contributor to several other widely-used machine learning packages. His mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science, and democratize access to high-quality machine learning algorithms.

Photo of Jennifer Klay

Jennifer Klay

Cal Poly San Luis Obispo

Jennifer Klay is an Associate Professor of Physics at Cal Poly San Luis Obispo. She has worked with big data at the CERN Large Hadron Collider’s ALICE experiment for 17 years, unlocking the secrets of the early Universe by colliding heavy nuclei at the highest energies available in the lab. She developed an introductory computational science course using the IPython notebook with the NumPy/SciPy codebase to teach data analysis and numerical methods for students in the physical sciences.

Photo of Peter Wang

Peter Wang


Peter Wang is the cofounder and CTO of Anaconda, where he leads the product engineering team for the Anaconda platform and open source projects including Bokeh and Blaze. Peter’s been developing commercial scientific computing and visualization software for over 15 years and has software design and development experience across a broad variety of areas, including 3-D graphics, geophysics, financial risk modeling, large data simulation and visualization, and medical imaging. As a creator of the PyData conference, he also devotes time and energy to growing the Python data community by advocating, teaching, and speaking about Python at conferences worldwide. Peter holds a BA in physics from Cornell University.

Photo of Travis Oliphant

Travis Oliphant


Travis Oliphant has a Ph.D. from the Mayo Clinic and B.S. and M.S. degrees in Mathematics and Electrical Engineering from Brigham Young University. Since 1997, he has worked extensively with Python for numerical and scientific programming, most notably as the primary developer of the NumPy package, and as a founding contributor of the SciPy package. He is also the author of the definitive Guide to NumPy.

Travis was an assistant professor of Electrical and Computer Engineering at BYU from 2001-2007, where he taught courses in probability theory, electromagnetics, inverse problems, and signal processing. He also served as Director of the Biomedical Imaging Lab, where he researched satellite remote sensing, MRI, ultrasound, elastography, and scanning impedance imaging.

From 2007-2011, Travis was the president at Enthought, Inc. During his tenure there, the company grew from 15 to 50 employees, and Travis worked with well-known Fortune 50 companies in finance, oil-and-gas, and consumer-products. He was involved in all aspects of the contractual relationship, including consulting, training, code-architecture, and development.

As CEO of Continuum Analytics, Travis engages customers in finance, consumer products, and oil and gas, develops business strategy, and helps guide technical direction of the company. He actively contributes to software development and engages with the wider open source community in the Python ecosystem by serving as a director of the Python Software Foundation and past director of Numfocus.

Photo of Andy Terrel

Andy Terrel


Andy Terrel is president of NumFOCUS. He is also the chief data scientist of REX Real Estate, where he brings his experience building smart, scalable data systems to the real estate industry. A data architect, computational scientist, and technical leader, Andy is a passionate advocate for open source scientific codes and has been involved in the wider scientific Python community since 2006, contributing to numerous projects in the scientific stack.

Photo of Matthew Rocklin

Matthew Rocklin


Matthew Rocklin is an open source software developer at Anaconda focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Photo of Wes McKinney

Wes McKinney

Two Sigma Investments

Wes McKinney is a software architect at Two Sigma Investments. He is the creator of Python’s pandas library and a PMC member for Apache Arrow and Apache Parquet. He wrote the book Python for Data Analysis. Previously, Wes worked for Cloudera and was the founder and CEO of DataPad.

Stefan van der Walt

UC Berkeley

Stéfan van der Walt is a senior lecturer in applied mathematics at
Stellenbosch University, South Africa, and an associate project
scientist in the astronomy department at UC Berkeley. He has been
involved in the development of scientific open source software since
2003, and enjoys teaching Python at workshops and conferences. Stéfan
is the founder of scikit-image and a contributor to numpy, scipy and dipy.

Photo of Jonathan Frederic

Jonathan Frederic


Jonathan is a full time IPython developer who primarily works on the IPython notebook front-end. In his spare time Jonathan enjoys developing a Python based video game engine and Poster, an open source HTML5 canvas based code editor.

Photo of Kyle Kelley

Kyle Kelley


Kyle Kelley is a senior software engineer at Netflix, a maintainer on, and a core developer of the IPython/Jupyter project. He wants to help build great environments for collaborative analysis, development, and production workloads for everyone, from small teams to massive scale.

Comments on this page are now closed.


Picture of Kyle Kelley
Kyle Kelley
02/17/2015 8:16am PST

Participants in the IPython tutorial can download presentation materials straight from GitHub . Attendees are advised to install Anaconda .

Picture of Jennifer Klay
Jennifer Klay
02/17/2015 5:30am PST

Participants in the Intro to NumPy tutorial can download the presentation notebook in advance from github:

See you on Wednesday!

Picture of Patrick Dirden
Patrick Dirden
02/16/2015 5:36am PST

Gaurav, you are now enrolled in PyData. See you on Wednesday!

Patrick Dirden
Registration Manager
O’Reilly Media, Inc.

02/16/2015 2:49am PST

During registration, i signed up for Spark tutorial.
But i want to attend Pydata. is there still room ?

Picture of Andy Terrel
Andy Terrel
01/27/2015 12:35pm PST

I expect most presentations will have some hands on aspects to them. That is the usual case, but perhaps a mix should be expected.

Manne Laukkanen
01/19/2015 8:02pm PST

Seconding Lynn Langit’s question. Hands-on or slideshow?

Picture of Lynn Langit
Lynn Langit
01/02/2015 2:54pm PST

Is this a hands-on session? or lecture?