Since its initial release in 2012, Apache Impala has been deployed on a wide range of cluster sizes. In recent years, deployments grew to sizes where Impala’s RPC layer—based on Apache Thrift RPC—couldn’t keep up. Its synchronous nature and lack of connection multiplexing made Impala consume exorbitant amounts of kernel resources, often leading to instabilities and query failures.
In the past 18 months, Apache Kudu’s RPC framework (KRPC) has been successfully integrated into Impala. Originally developed for the Kudu project, it was built from the ground up to support asynchronous communication between a large number of nodes across multiplexed connections. It also comes with support for TLS and Kerberos.
Lars Volker and Michael Ho discuss Impala’s distributed execution in detail, cover KRPC’s properties, and explain how they integrated KRPC into Impala. Along the way, they demonstrate how it enables Impala to scale beyond its previous limitations and touch on how they consume KRPC as a library to show how other projects looking for a scalable RPC implementation can benefit from their experience.
Lars Volker is a software engineer at Cloudera. He has worked on various parts of Apache Impala, including crash handling, its Parquet scanners, and scan range scheduling. Most recently, he worked on integrating Kudu’s RPC framework into Impala. Previously, he worked on various databases at SAP.
Michael Ho is a software engineer at Cloudera. He has worked on various parts of the Apache Impala query execution engine such as reducing codegen time, overhauling expressions evaluation, and most recently, making Impala more scalable. Before Cloudera, Michael used to build hypervisors and VMMs at VMware.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com