Data scientists and engineers who create data analytics applications use a wide variety of tools such as Python, R, and Spark. While some frameworks such as Spark or Oracle’s ORE offer scalability, many data science applications are prototyped using Python’s pandas, scikit-learn, or R’s analogous packages, which are comparatively scale-limited. While users reach for Python or R because they offer a familiar and feature-rich environment, the inability to run and validate prototypes at production scales is a major pain point.
We are developing a new Python framework, with a familiar user interface for users of pandas and other tools in the Python data science stack. It leverages Impala (a high performance relational query engine for Hadoop) under the hood for execution at scale.
In the talk, we will give an overview of existing tools for scalable analytics in Python, R, and other analytics and data science tools, and demonstrate how the new work relates to the rest of the ecosystem. We’ll discuss opportunities for future growth in the project (and others related to it), and ways that the community can get involved to help drive forward scalable open source data analytics.
Wes McKinney is a software architect at Two Sigma Investments. He is the creator of Python’s pandas library and a PMC member for Apache Arrow and Apache Parquet. He wrote the book Python for Data Analysis. Previously, Wes worked for Cloudera and was the founder and CEO of DataPad.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.