Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Scaling Impala: Common mistakes and best practices

Manish Maheshwari (Cloudera)
11:1511:55 Thursday, 2 May 2019
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Administrators, data engineers working with Impala, and developers of BI tools on top of Impala

Level

Intermediate

Prerequisite knowledge

  • A basic understanding of a modern MPP database (Impala or similar)

What you'll learn

  • Learn how to optimally set up and configure Impala for large-scale deployments
  • Explore methods to ensure consistent performance at scale, get started with query profiles, and identify bottlenecks

Description

Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.

Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.

Photo of Manish Maheshwari

Manish Maheshwari

Cloudera

Manish Maheshwari is a data architect and data scientist at Cloudera. Manish has 13+ years of experience building extremely large data warehouses and analytical solutions. He’s worked extensively on Hadoop, DI and BI tools, data mining and forecasting, data modeling, master and metadata management, and dashboard tools and is proficient in Hadoop, SAS, R, Informatica, Teradata, and Qlikview. He participates in Kaggle Data Mining competitions as a hobby.