Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Accelerating analytical antelopes: Integrating Apache Kudu's RPC into Apache Impala

Lars Volker (Cloudera), Michael Ho (Cloudera)
11:50am12:30pm Wednesday, March 27, 2019
Average rating: ****.
(4.50, 6 ratings)

Who is this presentation for?

  • Cluster and system admins and distributed software engineers

Level

Advanced

Prerequisite knowledge

  • Basic knowledge of (distributed) system primitives, threads, processes, and synchronous and asynchronous remote procedure calls

What you'll learn

  • Learn how Impala has been recently improved to scale out better
  • Understand why asynchronous, multiplexed, feature-rich RPC frameworks are a key enabler of medium to large-scale applications
  • See how choosing the right RPC framework can improve scalability by an order of magnitude
  • Understand why replacing the RPC framework in an existing application is difficult but possible

Description

Since its initial release in 2012, Apache Impala has been deployed on a wide range of cluster sizes. In recent years, deployments grew to sizes where Impala’s RPC layer—based on Apache Thrift RPC—couldn’t keep up. Its synchronous nature and lack of connection multiplexing made Impala consume exorbitant amounts of kernel resources, often leading to instabilities and query failures.

In the past 18 months, Apache Kudu’s RPC framework (KRPC) has been successfully integrated into Impala. Originally developed for the Kudu project, it was built from the ground up to support asynchronous communication between a large number of nodes across multiplexed connections. It also comes with support for TLS and Kerberos.

Lars Volker and Michael Ho discuss Impala’s distributed execution in detail, cover KRPC’s properties, and explain how they integrated KRPC into Impala. Along the way, they demonstrate how it enables Impala to scale beyond its previous limitations and touch on how they consume KRPC as a library to show how other projects looking for a scalable RPC implementation can benefit from their experience.

Photo of Lars Volker

Lars Volker

Cloudera

Lars Volker is a software engineer at Cloudera. He has worked on various parts of Apache Impala, including crash handling, its Parquet scanners, and scan range scheduling. Most recently, he worked on integrating Kudu’s RPC framework into Impala. Previously, he worked on various databases at SAP.

Photo of Michael Ho

Michael Ho

Cloudera

Michael Ho is a software engineer at Cloudera. He has worked on various parts of the Apache Impala query execution engine such as reducing codegen time, overhauling expressions evaluation, and most recently, making Impala more scalable. Before Cloudera, Michael used to build hypervisors and VMMs at VMware.