Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Scaling Python data analysis

Matthew Rocklin (Anaconda), Ben Zaitlen (Anaconda)
9:00am12:30pm Tuesday, September 26, 2017
Data science & advanced analytics
Location: 1E 15/16 Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Developers, data scientists, and researchers

Prerequisite knowledge

  • Familiarity with Python and the Jupyter Notebook
  • Basic experience with pandas and scikit-learn (useful but not required)

Materials or downloads needed in advance

  • A laptop (must be able to reach external websites and view ports 8888 and 8787) with Anaconda installed
  • Follow the install instructions on the course GitHub repository

What you'll learn

  • Gain hands-on experience with a variety of tools including the standard library, Spark, and Dask
  • Learn how to think about parallel data analysis and how to choose the right tool for the job

Description

The Python data science stack, which includes NumPy, pandas, and scikit-learn, is efficient and intuitive for data scientists. However, it was designed to run on data that fits in memory and runs only in a single core. Matthew Rocklin and Ben Zaitlen demonstrate how to parallelize and scale your Python workloads to multicore machines and multimachine clusters using a variety of tools, including the standard library, Spark, and Dask.

Using guided exercises in Jupyter notebooks, you’ll gain hands-on experience with parallel computing tools so you understand how to choose the right tool for the job.

Outline

Programming paradigms on a single machine

  • Data ingestion with embarrassingly parallel map
  • Flexible algorithms with futures
  • Big data collections
  • Hands-on exercise

Applying lessons to a cluster

  • Replay previous exercise on a cluster
  • The costs of communication and how performance characteristics can motivate different algorithms
  • Hands-on exercise

Wrap-up and Q&A

Cloud resources generously donated by Google.

Photo of Matthew Rocklin

Matthew Rocklin

Anaconda

Matthew Rocklin is an open source software developer at Anaconda focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Photo of Ben Zaitlen

Ben Zaitlen

Anaconda

Ben Zaitlen is a data scientist and developer at Anaconda. He has several years of experience with Python and is passionate about any and all forms of data. Currently, he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.

Comments on this page are now closed.

Comments

Picture of Mohammed Ayub
Mohammed Ayub | DATA SCIENTIST
09/26/2017 7:29am EDT

Hi Matthew,
Unfortunately, will miss this as it conflicts with another tutorial. Can we have the tutorial materials for reference for gold pass members?

Picture of Matthew Rocklin
Matthew Rocklin | COMPUTATIONAL SCIENTIST
09/21/2017 12:09pm EDT

That script queries the Google Finance API, which seems to have shifted. See relevant pandas-datareader issue here: https://github.com/pydata/pandas-datareader/issues/391

We’re working around this. I recommend trying again on Monday.

Reema Saha | DATA WAREHOUSE ARCHITECT
09/21/2017 11:26am EDT

I’ve downloaded and installed Anaconda application and am trying to get the dataset ready. Running into issues when trying to run the python prep.py