Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data science across data sources with Apache Arrow

Tomer Shiran (Dremio)
14:5515:35 Wednesday, 23 May 2018
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Data scientists, data engineers, and those working in virtualization

Prerequisite knowledge

  • A basic undersanding of SQL
  • Familiarity with Python (e.g., Jupyter) or R (useful but not required)

What you'll learn

  • Lean how to use Apache Arrow to utilize data from disparate data sources in your data science work
  • Understand the role that Arrow will play in a future where data is increasingly distributed and heterogeneous

Description

As companies continue to embrace modern architectures based on microservices and cloud applications, it has become increasingly difficult to physically consolidate all data into a single system. In a world where data is extremely fragmented and users expect instant gratification, the age-old approach of constructing and maintaining ETL pipelines can be prohibitively cumbersome and expensive.

Apache Arrow is an open source project, initiated by over a dozen open source communities, that provides a standard columnar in-memory data representation and processing framework. Arrow has emerged as a popular way way to handle in-memory data for analytical purposes. In the last year, Arrow has been embedded into a broad range of open source (and commercial) technologies, including GPU databases, machine learning libraries and tools, execution engines, and visualization frameworks (e.g., Anaconda, Dremio, Graphistry, H2O, MapD, pandas, R, and Spark).

Tomer Shiran offers an overview of Arrow, shows how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository, and explains how several open source projects are utilizing it to achieve high-performance data processing and interoperability across systems. Along the way, Tomer shares examples such as a 50x speedup in PySpark (Spark-pandas interoperability) and a join between Parquet files on S3, Oracle tables, and Elasticsearch indices. Tomer concludes by outlining Apache Arrow’s 12-month roadmap.

Photo of Tomer Shiran

Tomer Shiran

Dremio

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.