Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA
Please log in

Loosely coupled data with Apache Arrow Flight

Jacques Nadeau (Dremio)
1:50pm2:30pm Thursday, March 28, 2019
Average rating: ****.
(4.60, 5 ratings)

Who is this presentation for?

  • Data architects, data engineers, BI architects, and data scientists

Level

Intermediate

Prerequisite knowledge

  • Familiarity with SQL, analytics, and cloud services for data storage and compute

What you'll learn

  • Explore Apache Arrow Flight, a new way to exchange and analyze data between systems using an optimal format and libraries for CPU/GPU and RAM efficiency

Description

The number of data tools has skyrocketed in recent years. These tools are all very powerful, but it can frequently be challenging connecting them together. Connections increase processing time, are frequently single streamed, and are often built on legacy interfaces like ODBC, JDBC, and REST. Building a modern infrastructure requires leveraging these tools together since each part of your organization wants to construct a best-of-breed approach to data science and engineering tools.

Apache Arrow strives to solve part of this problem by allowing these systems to interchange common representations of data through in-process and near-process communications. For distributed and more complex topologies, something better is needed. Enter Arrow Flight.

Arrow Flight is a new initiative within Apache Arrow focused on providing a high-performance protocol and set of libraries for communicating analytical data in large parallel streams. It’s composed of several different implementations and example integrations that allow data engineering organizations to quickly build up data services that can move data between commodity systems at very high speeds.

Jacques Nadeau walks you through the components of Arrow Flight, covering the different ways that types of operations available within Arrow Flight as well as how these operations can be used for different use cases. He then shares several examples of Arrow Flight that are implemented to already provide better integration and performance. Along the way, Jacques also reviews operational considerations, including benchmarking performance and how collaborative backpressure, QOS, stream management, and security are implemented within Arrow Flight, and shares a small example application along with code that can highlight the strength and capabilities of Arrow Flight. He concludes with a discussion of where Arrow Flight is going, opportunities for growth, and how it fits into the concept of data microservices.

Photo of Jacques Nadeau

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.