Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Dask: Flexible Parallelism in Python for Advanced Analytics

Matthew Rocklin (Continuum)
2:55pm3:35pm Wednesday, September 27, 2017
Data science & advanced analytics, Machine Learning
Location: 1A 08/10 Level: Intermediate
Secondary topics:  Pydata

Who is this presentation for?

Data Scientists, algorithm researchers

Prerequisite knowledge

A basic understanding of the Python data science stack (NumPy/Pandas/Scikit-learn) will be assumed. Also a general understanding of common problems that require distributed computing.

What you'll learn

An understanding of how Dask can intuitively scale existing and novel Python advanced analytic workloads to distributed systems.

Description

The data science Python ecosystem (NumPy, Pandas, and Scikit-learn) are efficient and intuitive for advanced analytics workloads. Unfortunately, these tools are restricted to data that fits into memory and runs on a single core. Dask is a parallel computing library that complements the Python ecosystem by providing a distributed parallel framework for high-performance task scheduling.

Dask parallelizes Python libraries like NumPy and Pandas, and integrates with popular machine learning libraries like Scikit-learn, XGBoost, and Tensorflow. This effort, done in collaboration with existing Python development communities, provides a seamless big data experience for Python users for data analysis and complex analytics.

These parallel libraries are all backed by the same flexible task scheduler. This task scheduler is also part of the public API, and is commonly used independently by companies to build complex and reactive distributed systems for bespoke applications that fall outside of the typical use cases for more traditional distributed systems like Spark or Flink.

Matthew Rocklin discusses the basic architecture of Dask, classes of applications in which it is commonly useful, and how it fits into the broader big data ecosystem.

Photo of Matthew Rocklin

Matthew Rocklin

Continuum

Matthew Rocklin is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today leads development of Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)