SQL-on-Hadoop engines like Hive, Presto, Impala, Drill, and Spark SQL have made major strides in improving the performance of ad hoc and reporting queries. A big component of the performance improvement is to store the data sorted, bucketed, or partitioned on key columns. However, experience shows that these techniques are not used appropriately because of high operational overheads. Therefore, users have to manage with slow query times or unmanageable operational issues like very large number of partitions.
Qubole uses materialized views in Apache Hive to provide autonomous ETL, enabling data engineering teams to restructure the data in the right format and structure based on their workloads. Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Adesh and Abhishek first make a case for materialized views as the foundation for autonomous ETL to restructure data and then address challenges with materialized views and how these can be addressed within the framework, particularly for the creation and use of materialized views, automatic detection of changes to source tables and consequent invalidation of related materialized views, and automatic full and partial refreshes of materialized views on invalidation. Although Qubole uses these techniques with Apache Hive and Apache Presto, they have been implemented in an engine-agnostic fashion so that engines such as Spark SQL can utilize them as well.
Adesh Rao is a member of the technical staff on the Hive team at Qubole. He holds a degree from BITS Pilani.
Abhishek Somani is a senior staff engineer engineer on the Hive team at Qubole. Previously, Abhishek worked at Citrix and Cisco. He holds a degree from NIT Allahabad.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org