Improving Spark by taking advantage of disaggregated architecture
Who is this presentation for?
- Data engineers
Shuffle in Apache Spark is a procedure that redistributes data across partitions, which is often costly and requires the shuffle data to be persisted on local disks. There are many scalability and reliability issues in Spark shuffle regarding this procedure. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture in order to improve cost efficiency and scalability.
Chenzhao Guo and Carson Wang outline how to address the challenges in Spark shuffle and support disaggregated storage and compute architecture by implementing a new Spark shuffle manager. The new architecture supports writing shuffle data to a remote cluster with different storage backends. The failure of compute node no longer causes recomputation of the shuffle data. Spark executors can also be allocated and recycled dynamically, resulting in better resource use.
For most people running Spark with collocated storage, it’s usually challenging to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. This new shuffle manager enables building a separated cluster for storing and serving the shuffle data by leveraging the latest hardware to improve the performance and reliability. In high-performance computing (HPC) world, more people are starting to use Spark, and this work is also important for them as storage and compute in HPC clusters are typically disaggregated. You’ll leave with an overview of the challenges in the current Spark shuffle implementation and the design of the new shuffle manager. Chenzhao and Carson also present a performance study of the work.
- A basic understanding of Spark shuffle
What you'll learn
- Understand the essence of Spark shuffle and disaggregated architecture
Chenzhao Guo is a big data software engineer at Intel. He’s currently a contributor of Spark and a committer of OAP and HiBench. He graduated from Zhejiang University.
Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He’s an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires