Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Performance tuning your Hadoop/Spark clusters to use cloud storage

Stephen Wu (Microsoft)
11:20am12:00pm Thursday, September 28, 2017
Secondary topics:  Cloud
Average rating: ****.
(4.00, 1 rating)

Who is this presentation for?

  • Data scientists and solution architects

Prerequisite knowledge

  • A basic understanding of Hadoop, Spark, and YARN concepts
  • Familiarity with compute and storage for big data analytics

What you'll learn

  • Learn how to correctly performance tune your workloads when your data is stored in remote storage in the cloud


Remote storage provides the ability to separate compute and storage, which ushers in a new world of infinitely scalable and cost-effective storage. Remote storage in the cloud built to the HDFS standard has unique features that make it a great choice for storing and analyzing petabytes of data at a time. But with such scale, I/O performance becomes an increasingly important consideration when performing analysis on this data. Stephen Wu demonstrates how to correctly performance tune your workloads when your data is stored in remote storage in the cloud.

When running workloads atop of remote storage, maximizing the throughput usage between the compute layer and the storage layer is of primary importance. Often, the compute layer isn’t large enough to perform enough parallel read and writes to saturate the available throughput in the storage layer, leading to poor performance due to the underutilization of the available resources. However, there are many unique aspects of the compute layer, which includes the physical layer, YARN layer, and workload layer, that can increase the concurrency of read and writes to the store. The size and number of nodes in the user’s cluster should be chosen wisely to get the best performance tuning for each workload. Within the YARN layer, memory allocation and the number of containers are a few of the variables that can be tuned for all workloads. Additionally, setting tasks appropriately to utilize all the available resources will further improve job run times. In the workload layer, Spark and Hive have specific characteristics that can be modified for performance tuning. Taking advantage of the specific nuances of each workload will help the user extract every last bit of performance from remote storage.

In Spark, you must first determine the number of applications running on the cluster to understand the amount of resources available to run your workload. Then you must set executor memory and executor cores to optimize for an I/O intensive workload. Lastly, you have to determine the number of executors based on the available resources and the memory and core settings that your have previously chosen. In Hive, memory plays the main role in determining how many YARN containers can run concurrently. By tuning the memory, more YARN containers can be created to run more tasks in parallel. Controlling the split waves and mapper size will help to ensure that all available containers are used.

Photo of Stephen Wu

Stephen Wu


Stephen Wu is a senior program manager for big data at Microsoft.