Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Optimizing Apache Impala for a cloud-based data warehouse

Greg Rahn (Cloudera)
2:55pm–3:35pm Wednesday, 09/12/2018
Big data and data science in the cloud
Location: 1A 10 Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Enterprise architects, database developers, data engineers, and DBAs

Prerequisite knowledge

  • Basic familiarity with AWS or Azure instances, S3, ADLS, and Apache Impala

What you'll learn

  • Understand the advantages of using cloud object stores for your data warehouse
  • Learn how to optimally choose instance types for SQL workloads and store data in S3 or ADLS
  • Explore performance and tuning recommendations for Apache Impala

Description

Cloud object stores like Amazon’s S3 or Azure’s Data Lake Storage are becoming the bedrock of cloud data warehouses for modern data-driven enterprises. Given today’s data sizes, it has become a necessity for data teams to have the ability to directly query data stored in the object store. Apache Impala does just this. It lets users run SQL directly over data in S3—no data loading required. Greg Rahn and Mostafa Mokhtar discuss optimal end-to-end workflows and technical considerations for using Apache Impala over object stores for your cloud data warehouse.

Topics include:

  • Choosing optimal instance types
  • Choosing the optimal schema design, file format, and partitioning layout
  • End-to-end workflows for both transient and long-lived data analysis
  • Apache Impala performance optimizations for cloud and supporting benchmarks
Photo of Greg Rahn

Greg Rahn

Cloudera

Greg Rahn is director of product management at Cloudera, where he’s responsible for driving SQL product strategy as part of the company’s data warehouse product team, including working directly with Impala. For over 20 years, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently product management, providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.