Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.
The end-to-end system for streaming data injection and real-time and batch analytics uses Kafka and Spark for the messaging, streaming, and batch jobs. For the storage layer, the customer wanted to evaluate HDFS, HBase, and Kudu solutions for its usage scenarios. Wei and Zhaojuan discuss the challenges they encountered in tuning Kudu performance, largely because it’s a new storage engine, so there isn’t much available information to refer to.
The performance of the Kudu-based cluster varies significantly with different workload setups, hardware selections, and software parameters (OS VM parameters, hashed tablet count, maintenance thread number, etc.). For example, table schema design is critical to the performance of time series injection workloads. Small range partitioning is good to achieve a high injection rate since the number of bloom filter lookups can be reduced. However, it will result in the increase of scanned tablet count for analytic jobs. Different scenarios also require different hardware resources. For injection intensive scenarios, SSDs must be used as WAL disks. Faster, higher core count CPUs are also needed when active tablet count increases. However, after fixing these performance issues, Kudu offers a balanced solution.
Wei Chen is a software engineer at Intel. He is dedicated to performance optimization and simulation of storage engines for big data. Wei holds a master’s degree in signal and information processing from Nanjing University in China.
Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com