Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

Best practices with Kudu: An end-to-end user case from the automobile industry

Wei Chen (Intel), Zhaojuan Bian (Intel)
4:15pm4:55pm Thursday, December 7, 2017

Who is this presentation for?

  • Software engineers who work on storage engines or databases for big data

Prerequisite knowledge

  • A basic understanding of distributed storage systems, such as HDFS and HBase, columnar storage, and the Hadoop software stack

What you'll learn

  • Learn how to evaluate storage engines for different workloads and best practices for Kudu performance optimization

Description

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.

The end-to-end system for streaming data injection and real-time and batch analytics uses Kafka and Spark for the messaging, streaming, and batch jobs. For the storage layer, the customer wanted to evaluate HDFS, HBase, and Kudu solutions for its usage scenarios. Wei and Zhaojuan discuss the challenges they encountered in tuning Kudu performance, largely because it’s a new storage engine, so there isn’t much available information to refer to.

The performance of the Kudu-based cluster varies significantly with different workload setups, hardware selections, and software parameters (OS VM parameters, hashed tablet count, maintenance thread number, etc.). For example, table schema design is critical to the performance of time series injection workloads. Small range partitioning is good to achieve a high injection rate since the number of bloom filter lookups can be reduced. However, it will result in the increase of scanned tablet count for analytic jobs. Different scenarios also require different hardware resources. For injection intensive scenarios, SSDs must be used as WAL disks. Faster, higher core count CPUs are also needed when active tablet count increases. However, after fixing these performance issues, Kudu offers a balanced solution.

Topics include:

  • A comparison of HDFS, HBase, and Kudu for both injection and analytic jobs and why Wei and Zhaojuan selected Kudu
  • Best practices and key indicators on Kudu performance optimization, including table schema design, Kudu parameter tuning, hardware selection (processor, memory, disk and network), and software stack parameter tuning
  • Cluster capacity planning suggestions
Photo of Wei Chen

Wei Chen

Intel

Wei Chen is a software engineer at Intel. He is dedicated to performance optimization and simulation of storage engines for big data. Wei holds a master’s degree in signal and information processing from Nanjing University in China.

Photo of Zhaojuan Bian

Zhaojuan Bian

Intel

Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.