Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Optimizing Apache Impala for a cloud-based data warehouse

Greg Rahn (Cloudera)
2:55pm–3:35pm Wednesday, 09/12/2018
Big data and data science in the cloud
Location: 1A 10 Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Enterprise architects, database developers, data engineers, and DBAs

Prerequisite knowledge

  • Basic familiarity with AWS or Azure instances, S3, ADLS, and Apache Impala

What you'll learn

  • Understand the advantages of using cloud object stores for your data warehouse
  • Learn how to optimally choose instance types for SQL workloads and store data in S3 or ADLS
  • Explore performance and tuning recommendations for Apache Impala


Cloud object stores like Amazon’s S3 or Azure’s Data Lake Storage are becoming the bedrock of cloud data warehouses for modern data-driven enterprises. Given today’s data sizes, it has become a necessity for data teams to have the ability to directly query data stored in the object store. Apache Impala does just this. It lets users run SQL directly over data in S3—no data loading required. Greg Rahn and Mostafa Mokhtar discuss optimal end-to-end workflows and technical considerations for using Apache Impala over object stores for your cloud data warehouse.

Topics include:

  • Choosing optimal instance types
  • Choosing the optimal schema design, file format, and partitioning layout
  • End-to-end workflows for both transient and long-lived data analysis
  • Apache Impala performance optimizations for cloud and supporting benchmarks
Photo of Greg Rahn

Greg Rahn


Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)