Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Building real-time BI systems with HDFS and Kudu

Ruhollah Farchtchi (Zoomdata)
12:05–12:45 Thursday, 2/06/2016
Hadoop internals & development
Location: Capital Suite 15/16 Level: Intermediate
Average rating: ***..
(3.78, 9 ratings)

Prerequisite knowledge

Attendees should have a high-level familiarity with the main Hadoop ecosystem projects (Parquet, Avro, Impala, etc.).

Description

One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.

Ruhollah Farchtchi explores best practices for dealing with these challenges and the append-only nature of HDFS and discusses how to make sure data is distributed appropriately. This is challenging to do with static data and even tougher with real-time, dynamic data. Ruhollah also explains how to deal with updates to existing data, whether due to restatements or a need to compact the data.

Ruhollah then offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for fast analytics on rapidly changing data, demonstrates how Kudu simplifies the architecture of such systems, and reviews a number of lessons learned from working with Kudu, including how to use dictionary attributes to optimize storage of denormalized dimensional data; how to achieve a high degree of parallelization of queries via data distribution and sizing the right number of tablets based on available cores; and how to balance insert rates versus read-heavy workloads.

Photo of Ruhollah Farchtchi

Ruhollah Farchtchi

Zoomdata

Ruhollah Farchtchi is chief technologist and vice president of Zoomdata Labs. Ruhollah has over 15 years’ experience in enterprise data management architecture and systems integration. Prior to Zoomdata, he held management positions at BearingPoint, Booz-Allen, and Unisys. Ruhollah holds an MS in information technology from George Mason University.