Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

Pandas on Ray

Who is this presentation for?

  • Pandas users and data scientists

Prerequisite knowledge

  • Familiarity with Python and the pandas API

What you'll learn

  • Learn how to use Pandas on Ray to improve the performance of pandas on large datasets with the modification of one line of code

Description

Large-scale data science has traditionally been left to distributed computing experts, or at least those familiar with the concepts. Most designers of distributed systems give users knobs to tune and expose a significant amount of system configuration. Thus, the trade-off for incredible system performance is a significantly steeper learning curve and a requirement for domain knowledge. Most existing users just want pandas to run faster and aren’t looking to optimize their workflows for their particular hardware setup or data. Ideally, a user could use the same pandas script for a 10 KB dataset as a 10 TB dataset and have it run just as quickly with enough hardware resources. The Pandas on Ray project was created to accomplish these goals.

Pandas on Ray is an early stage DataFrame library built on top of the Ray distributed execution framework that wraps pandas and transparently distributes the data and computation. Users don’t need to know how many cores their system or cluster has or specify how to distribute the data. In fact, users can continue using their previous pandas notebooks while experiencing a considerable speedup compared to native pandas, even on a single machine. Only a modification of the import statement is needed: changing “import pandas as pd” to “import ray.dataframes as pd.” Once you’ve changed your import statement, you are ready to use pandas on Ray just like you would pandas.

Pandas on Ray is targeted at existing pandas users who are looking to improve performance and see faster runtimes without having to switch to another API and learn another framework, so that they can run their programs faster and at scale without any code changes. The development team is aggressively working to achieve functional parity with the full pandas API and have so far implemented a significant subset of the API. Preliminary results show that Pandas on Ray is 3–4x faster than pandas and Dask on a single operation on a single machine with eight cores. The goal is that pandas users can run pandas in a cloud setting without cumbersome configuration or much additional effort.