Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

File format benchmark: Avro, JSON, ORC, and Parquet

Owen O'Malley (Cloudera)
11:20am–12:00pm Wednesday, 09/28/2016
Data innovations
Location: 3D 10 Level: Intermediate
Average rating: ****.
(4.92, 12 ratings)

Prerequisite knowledge

  • An understanding of your organization's data use cases and how you expect to access the data
  • What you'll learn

  • Learn how to evaluate the various factors that influence the choice of the best format for your data without having to do a bake-off yourself
  • Description

    The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Picking the best data format depends on what kind of data you have and how you plan to use it. Depending on your use case, different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. Owen O’Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.

    Use cases include:

    • Reading all of the columns
    • Reading a few of the columns
    • Filtering using a filter predicate
    • Writing the data

    All of the benchmark code will be open source so that the experiments can be replicated. Furthermore, it is important to benchmark on real data rather than synthetic data. You’ll use the GitHub logs data available freely from
    the GitHub Archive.

    Photo of Owen O'Malley

    Owen O'Malley

    Cloudera

    Owen O’Malley is a co-founder and Technical Fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. In the last 10 years, he has been the architect of MapReduce, Security, and now Hive. Recently he has been driving the development of the ORC file format and adding ACID transactions to Hive.

    Comments on this page are now closed.

    Comments

    Picture of Owen O'Malley
    Owen O'Malley
    09/28/2016 11:53am EDT

    The slides are available at http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

    Scott Hauck
    09/28/2016 7:31am EDT

    Are your slides available?