Hive on Apache Tez: Benchmarked at Yahoo! Scale
The past year has seen the advent of various “low latency” solutions for querying big data (for varying definitions of “big”). Shark, Impala, Presto, and several other systems have been introduced under the same basic premise: that Hive on MR is too slow to be used for interactive queries. The Hive Stinger initiative aimed to tackle these concerns. At Yahoo, we have seen a strong demand for low latency, interactive queries (such as through BI tools like Tableau and MicroStrategy). But Yahoo scale is non-trivial, with dataset partitions running several TBs. Rather than introduce a completely new system into the mix, we’d prefer a common solution that provides quick results for queries on small datasets, while also scaling to support much larger data-sizes, all within the same framework.
The Hive team at Yahoo has spent the past several months benchmarking several versions of Hive (and Tez), with several permutations of file-formats, compression, and query features (vectorization, index filters, etc.), at various scales of data size. In this talk, we present our tests, the results and findings, and rules-of-thumb for tuning the OS/Hadoop/Tez/Hive for optimal performance. We will also touch upon the design decisions that enable both speed and scale.
Mithun Radhakrishnan is a committer on the HCatalog project, and a Hive developer at Yahoo. He’s the author of DistCp on Hadoop 0.23+. He’s an erstwhile firmware developer and is prone to flare-ups from C++ withdrawal.