Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)
Nicolas Poggi is an IT professional and researcher with focus on performance and scalability of data-intensive applications. Nicolas leads a new research project on upcoming architectures for the web at the Barcelona Super Computing and Microsoft Research Joint Center in Barcelona. Nicolas combines a pragmatic approach to performance and scalability from his web industry experience with research in server resource management (such as leveraging machine-learning techniques to optimize performance and profits on the web). Nicolas is a frequent speaker at and organizer for the Barcelona web performance community. He founded and has spoken at the Barcelona Web Performance group and is organizing the upcoming WebPerfDays event in Barcelona. Nicolas also lectures for master’s classes at UPC. He holds a PhD from BarcelonaTech (UPC).
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org