In pursuit of speed and efficiency, big data processing is continuing its logical evolution toward columnar execution. A number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis, Drill, and many others. Jacques Nadeau, vice president of Apache Arrow, and Julien Le Dem, vice president of Apache Parquet, discuss the future of columnar data processing and the hardware trends it can take advantage of.
Modern CPUs achieve higher throughput using SIMD instructions and vectorization on Apache Arrow’s columnar in-memory representation. Similarly, Apache Parquet provides storage and I/O optimized columnar data access using statistics and appropriate encodings. For interoperability, row-based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. This solid foundation for a shared columnar representation across the big data ecosystem promises great things for the future.
The Arrow and Parquet Apache projects define standard columnar representations, allowing interoperability without the usual cost of serialization. Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enable these tools to be used together seamlessly and efficiently without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket, avoiding serialization costs—and soon RDMA will allow exposing data remotely.
As in-memory processing becomes more popular, the traditional tiering of RAM as working space and HDD as persistent storage is outdated. More tiers are now available, such as SSDs and nonvolatile memory, that provide much higher data density, achieving a latency close to RAM at a fraction of the cost. Execution engines can take advantage of more-granular tiering and avoid the traditional spilling to disk, which impacts performance by an order of magnitude when the working dataset does not fit in main memory.
Julien Le Dem is a principal engineer at WeWork. He’s also the coauthor of Apache Parquet and the PMC chair of the project, and he’s a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.
Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.