Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Autonomous ETL with materialized views

Adesh Rao (Qubole), Abhishek Somani (Qubole)
12:0512:45 Thursday, 24 May 2018
Big data and data science in the cloud, Data engineering and architecture
Location: Capital Suite 8/9 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions
Average rating: ***..
(3.00, 2 ratings)

Who is this presentation for?

  • Data engineers, data team admins, and big data DevOps engineers

Prerequisite knowledge

  • A basic understanding of SQL engines on Hadoop

What you'll learn

  • Learn about materialized views in big data SQL engines
  • Explore a framework for automatic creation, use, invalidating, and refreshes of materialized views for faster ad hoc queries

Description

SQL-on-Hadoop engines like Hive, Presto, Impala, Drill, and Spark SQL have made major strides in improving the performance of ad hoc and reporting queries. A big component of the performance improvement is to store the data sorted, bucketed, or partitioned on key columns. However, experience shows that these techniques are not used appropriately because of high operational overheads. Therefore, users have to manage with slow query times or unmanageable operational issues like very large number of partitions.

Qubole uses materialized views in Apache Hive to provide autonomous ETL, enabling data engineering teams to restructure the data in the right format and structure based on their workloads. Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Adesh and Abhishek first make a case for materialized views as the foundation for autonomous ETL to restructure data and then address challenges with materialized views and how these can be addressed within the framework, particularly for the creation and use of materialized views, automatic detection of changes to source tables and consequent invalidation of related materialized views, and automatic full and partial refreshes of materialized views on invalidation. Although Qubole uses these techniques with Apache Hive and Apache Presto, they have been implemented in an engine-agnostic fashion so that engines such as Spark SQL can utilize them as well.

Photo of Adesh Rao

Adesh Rao

Qubole

Adesh Rao is a member of the technical staff on the Hive team at Qubole. He holds a degree from BITS Pilani.

Photo of Abhishek Somani

Abhishek Somani

Qubole

Abhishek Somani is a Staff Software engineer on the Hive team at Qubole. Previously, Abhishek worked at Citrix and Cisco. He holds a degree from NIT Allahabad.