Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

File format benchmark: Avro, JSON, ORC, and Parquet

Owen O'Malley (HortonWorks)
11:20am–12:00pm Wednesday, 09/28/2016
Data innovations
Location: 3D 10 Level: Intermediate
Average rating: ****.
(4.92, 12 ratings)

Prerequisite knowledge

  • An understanding of your organization's data use cases and how you expect to access the data
  • What you'll learn

  • Learn how to evaluate the various factors that influence the choice of the best format for your data without having to do a bake-off yourself
  • Description

    The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Picking the best data format depends on what kind of data you have and how you plan to use it. Depending on your use case, different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. Owen O’Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.

    Use cases include:

    • Reading all of the columns
    • Reading a few of the columns
    • Filtering using a filter predicate
    • Writing the data

    All of the benchmark code will be open source so that the experiments can be replicated. Furthermore, it is important to benchmark on real data rather than synthetic data. You’ll use the GitHub logs data available freely from
    the GitHub Archive.

    Photo of Owen O'Malley

    Owen O'Malley

    HortonWorks

    Owen O’Malley is a software architect on Hadoop working for HortonWorks, a startup focusing on Hadoop development. Prior to cofounding HortonWorks, Owen and the rest of the HortonWorks team worked at Yahoo developing Hadoop. He has been contributing patches to Hadoop since before it was separated from Nutch and was the original chair of the Hadoop PMC. Before working on Hadoop, he worked on Yahoo Search’s WebMap project, which builds a graph of the known Web and applies many heuristics to the entire graph that control search. Prior to Yahoo, Owen wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He holds a PhD in software engineering from the University of California, Irvine.

    Comments on this page are now closed.

    Comments

    Picture of Owen O'Malley
    Owen O'Malley
    09/28/2016 11:53am EDT

    The slides are available at http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

    Scott Hauck
    09/28/2016 7:31am EDT

    Are your slides available?