Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data science across data sources with Apache Arrow

Tomer Shiran (Dremio)

14:55–15:35 Wednesday, 23 May 2018

Big data and data science in the cloud, Data engineering and architecture
Location: S11A Level: Intermediate

Average rating:

(3.50, 2 ratings)

Who is this presentation for?

Data scientists, data engineers, and those working in virtualization

Prerequisite knowledge

A basic undersanding of SQL
Familiarity with Python (e.g., Jupyter) or R (useful but not required)

What you'll learn

Lean how to use Apache Arrow to utilize data from disparate data sources in your data science work
Understand the role that Arrow will play in a future where data is increasingly distributed and heterogeneous

Description

As companies continue to embrace modern architectures based on microservices and cloud applications, it has become increasingly difficult to physically consolidate all data into a single system. In a world where data is extremely fragmented and users expect instant gratification, the age-old approach of constructing and maintaining ETL pipelines can be prohibitively cumbersome and expensive.

Apache Arrow is an open source project, initiated by over a dozen open source communities, that provides a standard columnar in-memory data representation and processing framework. Arrow has emerged as a popular way way to handle in-memory data for analytical purposes. In the last year, Arrow has been embedded into a broad range of open source (and commercial) technologies, including GPU databases, machine learning libraries and tools, execution engines, and visualization frameworks (e.g., Anaconda, Dremio, Graphistry, H2O, MapD, pandas, R, and Spark).

Tomer Shiran offers an overview of Arrow, shows how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository, and explains how several open source projects are utilizing it to achieve high-performance data processing and interoperability across systems. Along the way, Tomer shares examples such as a 50x speedup in PySpark (Spark-pandas interoperability) and a join between Parquet files on S3, Oracle tables, and Elasticsearch indices. Tomer concludes by outlining Apache Arrow’s 12-month roadmap.

Tomer Shiran

Dremio

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Website

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com