SQL on Hadoop: Defining the New Generation of Analytic Databases

Sponsored Sessions Ballroom G
Presentation: external link
Average rating: ****.
(4.75, 8 ratings)

The analytics and data warehousing industries are in the midst of a major period of transformation and upheaval. Since the publication nearly a decade ago of Google’s seminal MapReduce and GFS papers, we have witnessed the appearance of Apache Hadoop, followed closely by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors.

This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, and defined by systems like Citus Data’s CitusDB and Cloudera’s Impala, these new systems share important architectural details that distinguish them from the previous generation of analytic databases.

In this talk we will discuss the unavoidable cost and performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive’s data model along with its influence on next generation SQL-on-Hadoop databases. We will then unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.

This session is sponsored by Citus Data

Photo of Carl Steinbach

Carl Steinbach


Carl Steinbach is a software engineer at Citus Data, as well as a committer and PMC member on the Apache Hive project. Previously Carl worked at Cloudera where he led the Hive team, at NetApp where he developed storage encryption products, and at Oracle where he was a member of the Server Technologies group. Carl holds B.S. and M.Eng. degrees in Computer Science from MIT.

Comments on this page are now closed.


Picture of Mark Madsen
Mark Madsen
08/30/2013 9:17am PDT

Good perspective in this talk, although I feel the comment “MPP DBs scale to tens of nodes” shows some ignorance of the MPP database market, given the multi-petabyte installs of MPP databases and their very large node counts.

Picture of Carl Steinbach
Carl Steinbach
03/01/2013 7:40am PST

Slides from the presentation are available here: http://www.slideshare.net/OReillyStrata/sql-on-hadoop-defining-the-new-generation-of-analytic-databases


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts