Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance.
Silvia Oliveros and Stephen O’Sullivan cover the four major data formats (plain text, SequenceFile, Avro, and Parquet) and provide insight into what they are and how to best use and store them in HDFS. Each of the data formats has different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, Silvia and Stephen have observed performance differences on the order of 25x between Parquet and plain text files for certain workloads. However, it isn’t the case that one is always better than the others.
Drawing from a few real-world use cases, Silvia and Stephen cover the hows, whys, and whens of choosing one format over another and take a closer look at some of the tradeoffs each offers.
Silvia Oliveros is a data engineer at Silicon Valley Data Science, where she helps clients explore and analyze their data. Silvia has a background in computer engineering and visual analytics and is interested in building and optimizing the infrastructure and data pipelines used to gather insights from various datasets.
A leading expert on big data architectures, Stephen O’Sullivan has 25 years of experience creating scalable, high-availability data and applications solutions. A veteran of Silicon Valley Data Science, @WalmartLabs, Sun, and Yahoo. Stephen is an independent adviser to enterprises on all things data..
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.