Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Scaling Python analytics on Impala

Wes McKinney (Two Sigma Investments)
11:20am–12:00pm Wednesday, 09/30/2015
Data Science & Advanced Analytics
Location: 1 E8 / 1 E9 Level: Intermediate
Average rating: ***..
(3.70, 10 ratings)

Data scientists and engineers who create data analytics applications use a wide variety of tools such as Python, R, and Spark. While some frameworks such as Spark or Oracle’s ORE offer scalability, many data science applications are prototyped using Python’s pandas, scikit-learn, or R’s analogous packages, which are comparatively scale-limited. While users reach for Python or R because they offer a familiar and feature-rich environment, the inability to run and validate prototypes at production scales is a major pain point.

We are developing a new Python framework, with a familiar user interface for users of pandas and other tools in the Python data science stack. It leverages Impala (a high performance relational query engine for Hadoop) under the hood for execution at scale.

In the talk, we will give an overview of existing tools for scalable analytics in Python, R, and other analytics and data science tools, and demonstrate how the new work relates to the rest of the ecosystem. We’ll discuss opportunities for future growth in the project (and others related to it), and ways that the community can get involved to help drive forward scalable open source data analytics.

Photo of Wes McKinney

Wes McKinney

Two Sigma Investments

Wes McKinney is a software architect at Two Sigma Investments. He is the creator of Python’s pandas library and a PMC member for Apache Arrow and Apache Parquet. He wrote the book Python for Data Analysis. Previously, Wes worked for Cloudera and was the founder and CEO of DataPad.