Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Format wars: From VHS and Beta to Avro and Parquet

Silvia Oliveros (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
1:50pm–2:30pm Thursday, 03/31/2016
Average rating: ***..
(3.58, 12 ratings)

Prerequisite knowledge

Attendees should be familiar with the Hadoop ecosystem.


Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance.

Silvia Oliveros and Stephen O’Sullivan cover the four major data formats (plain text, SequenceFile, Avro, and Parquet) and provide insight into what they are and how to best use and store them in HDFS. Each of the data formats has different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, Silvia and Stephen have observed performance differences on the order of 25x between Parquet and plain text files for certain workloads. However, it isn’t the case that one is always better than the others.

Drawing from a few real-world use cases, Silvia and Stephen cover the hows, whys, and whens of choosing one format over another and take a closer look at some of the tradeoffs each offers.

Photo of Silvia Oliveros

Silvia Oliveros

Silicon Valley Data Science

Silvia Oliveros is a data engineer at Silicon Valley Data Science, where she helps clients explore and analyze their data. Silvia has a background in computer engineering and visual analytics and is interested in building and optimizing the infrastructure and data pipelines used to gather insights from various datasets.

Photo of Stephen O'Sullivan

Stephen O'Sullivan

Data Whisperers

A leading expert on big data architectures, Stephen O’Sullivan has 25 years of experience creating scalable, high-availability data and applications solutions. A veteran of Silicon Valley Data Science, @WalmartLabs, Sun, and Yahoo. Stephen is an independent adviser to enterprises on all things data..

Comments on this page are now closed.


Picture of Phillip Radley
Phillip Radley
04/12/2016 7:13pm PDT

Pls could you clikc on the link and upload your slides. It only takes a minute.