The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Picking the best data format depends on what kind of data you have and how you plan to use it. Depending on your use case, different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. Owen O’Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.
Use cases include:
All of the benchmark code will be open source so that the experiments can be replicated. Furthermore, it is important to benchmark on real data rather than synthetic data. You’ll use the GitHub logs data available freely from
the GitHub Archive.
Owen O’Malley is a software architect on Hadoop working for HortonWorks, a startup focusing on Hadoop development. Prior to cofounding HortonWorks, Owen and the rest of the HortonWorks team worked at Yahoo developing Hadoop. He has been contributing patches to Hadoop since before it was separated from Nutch and was the original chair of the Hadoop PMC. Before working on Hadoop, he worked on Yahoo Search’s WebMap project, which builds a graph of the known Web and applies many heuristics to the entire graph that control search. Prior to Yahoo, Owen wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He holds a PhD in software engineering from the University of California, Irvine.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.