Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Friction-Free ETL: Automating Data Transformation with Impala

Marcel Kornacker (Cloudera)
11:30am–12:10pm Friday, 02/20/2015
Hadoop in Action
Location: 210 B/F
Average rating: *****
(5.00, 1 rating)

As data is ingested into Apache Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible — because “data freshness” can have a direct impact on business results.

In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics — for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain and reduces the speed at which new analytic applications can be introduced.

In this talk, attendee will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics — including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.

In this talk, attendees will learn about Impala’s upcoming features that will enable at-source analytics: support for nested structures such as JSON and XML documents, which allows direct querying of the source schema; automated background file format conversion into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem; automated creation of declaratively-specified derived data for simplified data cleansing and transformation.

Photo of Marcel Kornacker

Marcel Kornacker


Marcel Kornacker is the architect and tech lead at Cloudera for Impala. Prior to Cloudera Marcel worked at Google, where he worked on several ads serving and storage infrastructure projects and eventually became the tech lead for the distributed query engine component of Google’s F1 project. He holds a PhD in databases from UC Berkeley.

Comments on this page are now closed.


Miguel Lucero
02/23/2015 6:17am PST

Hi, I was wondering if you can make your slides available. This was a great session. Thanks!

Arthur Yeo
02/20/2015 7:37am PST

Where’s the location of your slides?