Improving Spark by taking advantage of disaggregated architecture
Who is this presentation for?Data Engineers
Shuffle in Apache Spark is a procedure redistributing data across partitions, which is often costly and requires the shuffle data to be persisted on local disks. There are many scalability and reliability issues in Spark shuffle regarding this procedure. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture in order to improve cost efficiency and scalability.
To address the challenges in Spark shuffle and support disaggregated storage and compute architecture, we implemented a new Spark shuffle manager. The new architecture supports writing shuffle data to a remote cluster with different storage backends. The failure of compute node will no longer causes recomputation of the shuffle data. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. For most customers who are running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, it enables them building a separated cluster for storing and serving the shuffle data by leveraging the latest hardware to improve the performance and reliability. In HPC world, more customers start to use Spark and this work is also important for them as storage and compute in HPC cluster are typically disaggregated. In this talk, we will provide an overview of the challenges in the current Spark shuffle implementation and the design of the new shuffle manager. A performance study of the work will also be presented.
Prerequisite knowledgeSpark shuffle
What you'll learn
Intel Asia-Pacific Research & Development Ltd.
Chenzhao Guo is a big data engineer at Intel. He graduated from Zhejiang University and joined Intel in 2016. He is currently a contributor of Spark and a committer of OAP and HiBench.
Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He is an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts