Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Hadoop's storage gap: Resolving transactional-access and analytic-performance tradeoffs with Apache Kudu (incubating)

Todd Lipcon (Cloudera)
1:50pm–2:30pm Wednesday, 03/30/2016
Tags: real-time
Average rating: ****.
(4.68, 19 ratings)

Prerequisite knowledge

Attendees should be familiar with Hadoop storage alternatives.

Description

Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.

Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics but little-to-no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access but scan rates that are too slow for large-scale data-warehousing workloads.

Todd Lipcon explores the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage-engine internals. Todd also outlines Kudu, the new addition to the open source Hadoop ecosystem that complements HDFS and HBase to provide a new option for achieving fast scans and fast random access from a single API.

Photo of Todd Lipcon

Todd Lipcon

Cloudera

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Comments on this page are now closed.

Comments

Picture of Todd Lipcon
Todd Lipcon
02/29/2016 12:05pm PST

Hi Cristofer. You’re right — most of the presentation will be similar to the presentation given at last year’s Strata, since Kudu is still relatively new and we expect that most people attending this west-coast Strata probably didn’t see the introductory talk in NYC last fall. However, we’ll of course update the talk to discuss the current state of the project up through the latest releases.

I would say, though, that if you already attended the talk in NYC, you’d probably find it a better use of your time at the conference to attend a different talk :)

-Todd

02/29/2016 10:11am PST

Is this presentation different from last Strata NY? Title and description looks pretty much the same.