Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Autonomous ETL with materialized views

Adesh Rao (Qubole), Abhishek Somani (Qubole)

12:05–12:45 Thursday, 24 May 2018

Big data and data science in the cloud, Data engineering and architecture
Location: Capital Suite 8/9 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines sessions

Average rating:

(3.00, 2 ratings)

Download slides (PDF)

Who is this presentation for?

Data engineers, data team admins, and big data DevOps engineers

Prerequisite knowledge

A basic understanding of SQL engines on Hadoop

What you'll learn

Learn about materialized views in big data SQL engines
Explore a framework for automatic creation, use, invalidating, and refreshes of materialized views for faster ad hoc queries

Description

SQL-on-Hadoop engines like Hive, Presto, Impala, Drill, and Spark SQL have made major strides in improving the performance of ad hoc and reporting queries. A big component of the performance improvement is to store the data sorted, bucketed, or partitioned on key columns. However, experience shows that these techniques are not used appropriately because of high operational overheads. Therefore, users have to manage with slow query times or unmanageable operational issues like very large number of partitions.

Qubole uses materialized views in Apache Hive to provide autonomous ETL, enabling data engineering teams to restructure the data in the right format and structure based on their workloads. Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Adesh and Abhishek first make a case for materialized views as the foundation for autonomous ETL to restructure data and then address challenges with materialized views and how these can be addressed within the framework, particularly for the creation and use of materialized views, automatic detection of changes to source tables and consequent invalidation of related materialized views, and automatic full and partial refreshes of materialized views on invalidation. Although Qubole uses these techniques with Apache Hive and Apache Presto, they have been implemented in an engine-agnostic fashion so that engines such as Spark SQL can utilize them as well.

Adesh Rao

Qubole

Adesh Rao is a member of the technical staff on the Hive team at Qubole. He holds a degree from BITS Pilani.

Website

Abhishek Somani

Qubole

Abhishek Somani is a senior staff engineer engineer on the Hive team at Qubole. Previously, Abhishek worked at Citrix and Cisco. He holds a degree from NIT Allahabad.

Website

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com