As Uber continues to grow, its big data systems must also grow in scalability, reliability, and performance to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Zhenxiao Luo shares his experience running columnar storage in production at Uber and discusses query optimization techniques in SQL engines.
Uber’s Hadoop warehouse uses columnar storage with Parquet as the default file format, Presto as its interactive query engine, and Hive and Spark as the batch engines. Zhenxiao explains how Uber developed a number of performance optimizations for columnar storage in all of these query engines to achieve much better performance for customers, including nested column pruning, predicate pushdown, dictionary pushdown, columnar reads, and lazy reads, achieving a more than 5x performance improvement in all query engines.
Zhenxiao Luo is leading Interactive Query Engines team at Twitter, where he focuses on Druid, Presto, Spark, and Hive. Before joining Twitter, Zhenxiao was running Interactive Analytics team at Uber. He has big data experience at Netflix, Facebook, Cloudera, and Vertica. Zhenxiao is Committer and Technical Steering Committee(TSC) member of Presto. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com