Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Hadoop and object stores: Can we do it better?

14:0514:45 Wednesday, 24 May 2017
Level: Beginner

Who is this presentation for?

  • Anyone interested in Hadoop and Spark integration with object stores

What you'll learn

  • Explore Stocator, an open source object store connector
  • Learn how to make Hadoop and Spark work better with object stores


Apache Hadoop installations traditionally collocate compute and storage using. However, there is a new trend toward disaggregation, where the compute and storage clusters are separate and can be scaled independently. The popularity of object stores has also increased recently. Object stores offer a highly scalable and low cost alternative for storing the huge amounts of data being generated and collected by individuals, businesses, and organizations. Combining both trends, many have started to explore the impact of using object storage as the primary data storage for their Hadoop and Spark compute clusters.

Hadoop can easily operate without HDFS and contains modules that provide a convenient way to access object stores like Amazon S3, OpenStack Swift, Azure Blob Storage, and IBM Cloud Object Storage. Many big data projects, like Apache Spark or Alluxio, rely on these same Hadoop modules to ease their integration with object stores. However, these modules are designed to work with filesystems rather than object stores. Thus, they contain flows and operations (e.g., the creation of temporary objects and rename) that are not native to object stores and introduce unnecessary complexity and overhead.

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used. In tests, running DFSIO on Hadoop with Stocator achieved a 37% speed up for write flows and a 96% speed up for the read flows compared to Hadoop with its native object store connectors, while running Terasort on Spark with Stocator achieved a 500% speedup compared to the native connectors. Wordcount on Spark, where the output is much smaller than the input, achieved a 100% speedup.

Photo of Trent Gray-Donald

Trent Gray-Donald


Trent Gray-Donald is a distinguished engineer in IBM Analytics’s Analytic Platform Services organization, where he works on analytics services for IBM Bluemix, including IBM Big Insights and Apache Spark. Previously, Trent worked on high-speed in-memory analytics solutions, such as Cognos BI and DB2 BLU. He was a member of the IBM Java Technology Centre and was overall technical lead on the IBM Java 7 project. Trent holds a bachelor of mathematics in computer science from the University of Waterloo, Canada.

Photo of Gil Vernik

Gil Vernik


Gil Vernik is a researcher in the Storage Clouds, Security, and Analytics Group at IBM, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.

Comments on this page are now closed.


Raja K Thaw |
10/01/2017 17:19 GMT

I think that’s the way it should go, Spark with object storage-compression, dedupe behind the scene. It is flexible to bring to NAS or block storage if required.