Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Creating real-time, data-centric applications with Impala and Kudu

Todd Lipcon (Cloudera), Marcel Kornacker (Cloudera)
5:05pm–5:45pm Wednesday, December 7, 2016
Data science and advanced analytics
Location: Summit 2 Level: Intermediate
Average rating: ****.
(4.17, 6 ratings)

Prerequisite Knowledge

  • A basic understanding of SQL and the Hadoop ecosystem

What you'll learn

  • Explore a modern big data architecture for real-time, data-centric applications


Running real-time data-intensive applications on Apache Hadoop requires complex architectures to store and query data, typically involving multiple independent systems that are tied together through custom-engineered pipelines. A common pattern is to use a NoSQL engine like Apache HBase for caching and later transformations, the results of which are periodically written to HDFS in one of the popular open columnar file formats as a prerequisite for querying by a SQL engine.

Apache Kudu (incubating), a new scalable distributed storage engine designed for the Hadoop environment, gives the user low-latency single-row access as well as high-throughput bulk data scans. Integrated with Apache Impala (incubating), these capabilities are made available to the user via standard SQL language elements for updates and querying, combining the flexible update functionality of an RDBMS with the performance of a parallel analytic database system.

Todd Lipcon and Marcel Kornacker provide an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.

Photo of Todd Lipcon

Todd Lipcon


Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Photo of Marcel Kornacker

Marcel Kornacker


Marcel Kornacker is a tech lead at Cloudera and the architect of Apache Impala (incubating). Marcel has held engineering jobs at a few database-related startup companies and at Google, where he worked on several ad-serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project. Marcel holds a PhD in databases from UC Berkeley.