Apache Hadoop installations traditionally collocate compute and storage using. However, there is a new trend toward disaggregation, where the compute and storage clusters are separate and can be scaled independently. The popularity of object stores has also increased recently. Object stores offer a highly scalable and low cost alternative for storing the huge amounts of data being generated and collected by individuals, businesses, and organizations. Combining both trends, many have started to explore the impact of using object storage as the primary data storage for their Hadoop and Spark compute clusters.
Hadoop can easily operate without HDFS and contains modules that provide a convenient way to access object stores like Amazon S3, OpenStack Swift, Azure Blob Storage, and IBM Cloud Object Storage. Many big data projects, like Apache Spark or Alluxio, rely on these same Hadoop modules to ease their integration with object stores. However, these modules are designed to work with filesystems rather than object stores. Thus, they contain flows and operations (e.g., the creation of temporary objects and rename) that are not native to object stores and introduce unnecessary complexity and overhead.
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used. In tests, running DFSIO on Hadoop with Stocator achieved a 37% speed up for write flows and a 96% speed up for the read flows compared to Hadoop with its native object store connectors, while running Terasort on Spark with Stocator achieved a 500% speedup compared to the native connectors. Wordcount on Spark, where the output is much smaller than the input, achieved a 100% speedup.
Trent Gray-Donald is a distinguished engineer in IBM Analytics’s Analytic Platform Services organization, where he works on analytics services for IBM Bluemix, including IBM Big Insights and Apache Spark. Previously, Trent worked on high-speed in-memory analytics solutions, such as Cognos BI and DB2 BLU. He was a member of the IBM Java Technology Centre and was overall technical lead on the IBM Java 7 project. Trent holds a bachelor of mathematics in computer science from the University of Waterloo, Canada.
Gil Vernik is a researcher in IBM’s Storage Clouds, Security, and Analytics group, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org