Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Setting up a lightweight distributed caching layer using Apache Arrow

Jacques Nadeau (Dremio)
4:35pm–5:15pm Wednesday, 09/12/2018
Data engineering and architecture
Location: 1A 10 Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data architects and data engineers

Prerequisite knowledge

  • A basic understanding of distributed caching architecture and caching architecture

What you'll learn

  • Explore Apache Arrow, from its design and architecture to using it in applications
  • Learn how to use Arrow to achieve various objectives for performance, governance, and access

Description

Apache Arrow has quickly become the standard for high-performance in-memory processing. It has integration into major open source projects such as Spark, pandas, Parquet, Dremio, libgdf, and the GPU Open Analytics Initiative (GoAi). All of these projects have adopted Arrow as the go-to representation for data processing and interchange, which has substantially changed how well systems can share and process data. However, systems today only generate Arrow representation data ephemerally. The translation from on-disk formats to Arrow can diminish the overall performance potential.

Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. Jacques offers an overview of the system design and deployment architecture, focusing on the cache lifecycle, update patterns, cache cohesion, and appropriate use cases. He then explores in detail how data science, analytical, and custom applications can all leverage the cache simultaneously as well as the trade-offs around in-memory representations, data size, and balancing working memory with cache overhead. Along the way, Jacques discusses security, upgrades, and versioning, focusing on how to balance performance, access, and governance. Jacques concludes with a demonstration of the cache in action, showing its impact on overall performance and ultimately, end-user satisfaction.

Photo of Jacques Nadeau

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.