Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

How to use Parquet as a basis for ETL and analytics

Julien Le Dem (WeWork)
1:30pm–2:10pm Friday, 02/20/2015
Hadoop Platform
Location: 210 C/G
Average rating: *....
(1.40, 5 ratings)
Slides:   1-PDF 

Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.

Photo of Julien Le Dem

Julien Le Dem

WeWork

Julien Le Dem is a Data Systems Engineer at Twitter. Previously he was a Principal Engineer at Yahoo. He contributes to a number of Hadoop-related projects including HCatalog and he’s a PMC member on Pig.