Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Scaling Python Data Analysis

Matthew Rocklin (Continuum), Ben Zaitlen (Continuum Analytics)
9:00am12:30pm Tuesday, September 26, 2017
Data science & advanced analytics
Location: 1A 23/24 Level: Intermediate

Who is this presentation for?

Developers, Data Scientists, Researchers

Prerequisite knowledge

Familiarity with Python and Jupyter notebooks is assumed. Some basic experience with Pandas and Scikit-Learn will be helpful.

Materials or downloads needed in advance

Students should bring a laptop. If they want to work on their own computer for the first half (recommended) then they should install Anaconda (or self-install all of the necessary libraries). Alternatively we will provide a cloud-hosted solution. Their laptop should be able to reach external websites and view ports 8888 and 8787.

What you'll learn

Gain hands-on experience with a variety of tools including the standard library, Spark, and Dask. A general understanding of how to think about parallel data analysis, and how to choose the right tool for the job.

Description

The Python Data science stack including NumPy, Pandas, Scikit-Learn and other libraries is efficient and intuitive for data scientists. However it was designed to run on data that fits in-memory and runs only in a single core. This tutorial teaches you to parallelize and scale your Python data science workloads to multi-core machines and multi-machine clusters. We cover a variety of tools including the standard library, Spark, and Dask. This comparative approach will help students understand how to think broadly about parallel applications, and choosing the right tool for the job.

In this hands-on tutorial students will work through guided exercises presented as Jupyter notebooks. They will start on their own personal computer for the first half of the tutorial and for the second half will switch to a cloud-hosted cluster for hands-on experience with a distributed machine (we will provide all setup).

Students will walk away from this comparative tutorial both with hands-on experience with a few parallel computing tools and with an understanding
of how to choose the right tool for the job.

Outline
-——

Basic introduction, outline, and background

Part One: Programming paradigms on a single machine
1. Data ingestion with embarrassingly parallel map
2. Flexible algorithms with futures
3. Big Data collections
4. Capstone exercise with machine learning

Part Two: Applying lessons to a cluster
1. Replay previous exercise on a cluster. Experience that programming
techniques in the last section carry over to distributed systems.
2. Learn about the costs of communication and how performance
characteristics can motivate different algorithms
3. Branch out to one of a few domain-specific exercises, experimenting
with previous lessons

Conclusion. Recap of lessons learned and links for more information.

Cloud resources generously donated by Google.

https://github.com/mrocklin/parallel-data-analysis

Photo of Matthew Rocklin

Matthew Rocklin

Continuum

Matthew Rocklin is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Photo of Ben Zaitlen

Ben Zaitlen

Continuum Analytics

Ben is a data scientist and developer at Continuum Analytics. He has several years of experience with Python and is passionate about any and all forms of data. Currently he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)