Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Scaling database and analytic workloads with Apache Kudu

4:35pm5:15pm Wednesday, September 27, 2017
Data Engineering & Architecture
Location: 1A 21/22 Level: Beginner
Average rating: ****.
(4.00, 2 ratings)

Who is this presentation for?

  • Managers, data scientists, and data and system engineers

Prerequisite knowledge

  • Basic knowledge of distributed computing and the Hadoop ecosystem (HDFS, YARN, HBase, Spark, etc.)

What you'll learn

  • Understand Apache Kudu fundamental concepts
  • Learn how to use Apache Kudu for scale-out database-like systems


Coupling online data processing with scalable analytics is a popular technique for systems that produce a large amount of data, such as those at CERN. This has been always difficult to achieve with traditional database systems or the Hadoop ecosystem. Although it is feasible, it involves many compromises and brings with it extra costs or complexity.

Apache Kudu is a new, innovative distributed storage that combines low-latency data ingestion, scalable analytics, and fast data lookups. But what does it deliver in practice? Zbigniew Baranowski explains how to use Apache Kudu for scale-out database-like systems, such as those used at CERN for controlling and supervising the accelerators infrastructure and the particle collisions catalogue, covering the advantages and limitations and measuring performance.

Topics include:

  • The rationality of choosing Apache Kudu for these systems
  • Advantages and limitations of Apache Kudu based on user experience
  • The performance measured with CERN’s workloads
  • A comparison of Apache Kudu with other popular data formats or storage engines available in the Hadoop ecosystem
Photo of Zbigniew Baranowski

Zbigniew Baranowski


Zbigniew Baranowski is a database system specialist and a member of a group that provides central database and Hadoop services at CERN.