Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Creating a virtual data lake with Apache Arrow

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
14:5515:35 Thursday, 25 May 2017
Data engineering and architecture
Location: Capital Suite 10/11
Level: Intermediate
Average rating: ****.
(4.75, 4 ratings)

Who is this presentation for?

  • Data engineers, data scientists, and architects

Prerequisite knowledge

A basic understanding of at least one of programming language (useful but not required)

What you'll learn

  • Understand the architecture and benefits of Apache Arrow, an open source in-memory columnar technology driven by over a dozen companies and open source projects
  • Learn how to extend a data lake beyond Hadoop so that data can be joined in real time across disparate data sources

Description

Organizations are increasingly adopting modern data stores (Hadoop, NoSQL, etc.) and public cloud infrastructure for new applications. This introduces many challenges for data consumers such as business analysts and data scientists. For example, standard BI tools don’t work with NoSQL databases (because they don’t support SQL), and their performance on large Hadoop and cloud storage datasets is often prohibitively slow.

As a result, IT and data engineering often resort to ETLing the data from these systems into a relational data warehouse. This process is complex and expensive and introduces significant delays in data availability.

Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R. Tomer and Jacques outline the architecture of an Arrow-based solution to querying data from disparate data sources before highlighting best practices for accelerating queries such as in-memory caching and pre-aggregation. They conclude with a live demo exploring and analyzing data across HDFS, S3, MongoDB, and Elasticsearch utilizing popular client applications, including Tableau, Excel, Python, and R.

Photo of Tomer Shiran

Tomer Shiran

Dremio

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Previously, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Photo of Jacques Nadeau

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.