Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

The future of column-oriented data processing with Arrow and Parquet

Julien Le Dem (WeWork), Jacques Nadeau (Dremio)
2:55pm–3:35pm Wednesday, 09/28/2016
Data innovations
Location: 1 E 07/1 E 08 Level: Advanced
Tags: real-time
Average rating: ****.
(4.33, 6 ratings)

Prerequisite knowledge

  • A basic understanding of the Hadoop ecosystem and query engines
  • What you'll learn

  • Understand why Apache Parquet and Arrow matter and what their roles in the evolution of the big data ecosystem are
  • Explore the hardware trends that will benefit Parquet and Arrow in the future
  • Description

    In pursuit of speed and efficiency, big data processing is continuing its logical evolution toward columnar execution. A number of key big data technologies, including Kudu, Ibis, and Drill, have or will soon have in-memory columnar capabilities. The solid foundation laid by Apache Arrow and Apache Parquet for a shared columnar representation across the ecosystem promises a great future. Modern CPUs achieve higher throughput using SIMD instructions and vectorization on Arrow’s columnar in-memory representation. Similarly, Parquet provides storage and I/O optimized columnar data access using statistics and appropriate encodings. For interoperability, row-based encodings (CSV, Thrift, Avro) combined with general-purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. The Arrow and Parquet projects define standard columnar representations allowing interoperability without the usual cost of serialization.

    Jacques Nadeau, vice president of Apache Arrow, and Julien Le Dem, vice president of Apache Parquet, discuss the future of columnar data processing and the hardware trends it takes advantage of. Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enable them to be used together seamlessly and efficiently without overhead: when collocated on the same processing node, read-only shared memory and IPC avoid communication overhead; when remote, scatter-gather I/O sends the memory representation directly to the socket, avoiding serialization costs, and soon RDMA will allow exposing data remotely. As in-memory processing becomes more popular, the traditional tiering of RAM as working space and HDD as persistent storage is outdated. More tiers are now available like SSDs and nonvolatile memory, which provide much higher data density and achieve a latency close to RAM at a fraction of the cost. Execution engines can take advantage of more granular tiering and avoid the traditional spilling to disk, which impacts performance by an order of magnitude when the working dataset does not fit in main memory.

    Photo of Julien Le Dem

    Julien Le Dem

    WeWork

    Julien Le Dem is a principal engineer at WeWork. He’s also the coauthor of Apache Parquet and the PMC chair of the project, and he’s a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

    Photo of Jacques Nadeau

    Jacques Nadeau

    Dremio

    Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

    Comments on this page are now closed.

    Comments

    Picture of Julien Le Dem
    Julien Le Dem
    09/30/2016 10:07am EDT

    Slides: http://www.slideshare.net/julienledem/strata-ny-2016-the-future-of-columnoriented-data-processing-with-arrow-and-parquet

    09/29/2016 7:38am EDT

    are slides available?