Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Scaling the Python data experience

Wes McKinney (Two Sigma Investments)
1:30pm–2:10pm Thursday, 12/03/2015
Data Science and Advanced Analytics
Location: 321-322 Level: Intermediate
Average rating: ***..
(3.60, 5 ratings)

Prerequisite Knowledge

Python data basics


Data scientists and engineers who create data analytics applications use a wide variety of tools such as Python, R, and Spark. While some frameworks such as Spark or Oracle’s ORE offer scalability, many data science applications are prototyped using Python’s pandas, scikit-learn, or R’s analogous packages, which are comparatively scale-limited. While users reach for Python or R because they offer a familiar and feature-rich environment, the inability to run and validate prototypes at production scales is a major pain point.

We are developing a new Python framework with a familiar user interface for users of pandas and other tools in the Python data science stack, which leverages Impala (a high performance relational query engine for Hadoop) under the hood for execution at scale, and simplifies interactions with HDFS and other Hadoop components.

In the talk, we will give an overview of existing tools for scalable analytics in Python, R, and other analytics and data science tools, and demonstrate how the new work relates to the rest of the ecosystem. We’ll discuss opportunities for future growth in the project (and others related to it), and ways that the community can get involved to help drive forward scalable open source data analytics.

Photo of Wes McKinney

Wes McKinney

Two Sigma Investments

Wes McKinney is a software architect at Two Sigma Investments. He is the creator of Python’s pandas library and a PMC member for Apache Arrow and Apache Parquet. He wrote the book Python for Data Analysis. Previously, Wes worked for Cloudera and was the founder and CEO of DataPad.