Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

The state of Spark in the cloud

Nicolas Poggi (Barcelona Supercomputing-Microsoft Research Center)
14:0514:45 Thursday, 25 May 2017
Big data and the Cloud, Spark & beyond
Location: Capital Suite 12
Level: Intermediate

Who is this presentation for?

  • Data engineers, managers, those in operations, and data scientists

Prerequisite knowledge

  • A basic understanding of Spark or Hive
  • Familiarity with cloud usage and offerings

What you'll learn

  • Compare current cloud offerings on Spark, including versions, architectures, and price performance, and learn how different they are from Hive and Hadoop
  • Understand where Spark shines over Hive, Hadoop, and Mahout
  • Get an introduction to BigBench and benchmarking in general


Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).

The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)

Photo of Nicolas Poggi

Nicolas Poggi

Barcelona Supercomputing-Microsoft Research Center

Nicolas Poggi is an IT professional and researcher with focus on performance and scalability of data-intensive applications. Nicolas leads a new research project on upcoming architectures for the web at the Barcelona Super Computing and Microsoft Research Joint Center in Barcelona. Nicolas combines a pragmatic approach to performance and scalability from his web industry experience with research in server resource management (such as leveraging machine-learning techniques to optimize performance and profits on the web). Nicolas is a frequent speaker at and organizer for the Barcelona web performance community. He founded and has spoken at the Barcelona Web Performance group and is organizing the upcoming WebPerfDays event in Barcelona. Nicolas also lectures for master’s classes at UPC. He holds a PhD from BarcelonaTech (UPC).