Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Hadoop's storage gap: Resolving transactional access/analytic performance trade-offs with Kudu

Todd Lipcon (Cloudera)
2:05pm–2:45pm Wednesday, 09/30/2015
Hadoop Internals & Development
Location: 1 E16 / 1 E17 Level: Intermediate
Average rating: ***..
(3.44, 18 ratings)

Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.

Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.

This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.

Photo of Todd Lipcon

Todd Lipcon


Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Comments on this page are now closed.


Picture of Todd Lipcon
Todd Lipcon
10/05/2015 10:05am EDT

Hey Marcio. I’ll answer your two questions separately:

1) The in-memory row store is used for newly inserted or updated data. It’s not a separate store used for OLTP — more like a write buffer which is later flushed. If you perform updates or deletes on data which was inserted long ago, those rows are not “moved back” into the row store – rather, the updates/deletes themselves are managed in an in-memory delta store. Sorry if this section of the talk was confusing — I tried my best to condense the discussion of internals into the confines of 40 minutes while still leaving time for the overview and use cases.

2) It’s correct that Kudu isn’t built on top of HDFS, but it integrates tightly into the Hadoop ecosystem. For example, we can use Spark, Impala, or MapReduce to read/write data in Kudu, and you can transparently access data across HDFS and Kudu in JOIN queries, etc. Almost of our testing of Kudu has been in the context of Hadoop clusters (eg running MR jobs on YARN to check correctness properties or using Impala to drive TPC-H workloads)

Hope that helps

Picture of Ben Lorica
Ben Lorica
10/02/2015 3:34pm EDT

Video is available here:

Clint Blacker
10/02/2015 11:06am EDT

Will you provide the slides and/or video to those of us who were not able to get into the room? This was a very popular session!

Picture of Marcio Moura
Marcio Moura
09/30/2015 1:13pm EDT

Hi Todd,
I have a question about Kudu…
I understood that Kudu has a in-memory row store for oltp transactions and a disk based columnar store for OLAP. Also the storage engine has nothing to do with Hadoop/HDFS. Can you confirm that my assumptions are correct?