Sep 23–26, 2019
Please log in

Improving Spark by taking advantage of disaggregated architecture

Chenzhao Guo (Intel), Carson Wang (Intel)
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 21/22
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Data engineers

Level

Intermediate

Description

Shuffle in Apache Spark is a procedure that redistributes data across partitions, which is often costly and requires the shuffle data to be persisted on local disks. There are many scalability and reliability issues in Spark shuffle regarding this procedure. Moreover, the assumptions of collocated storage do not always hold in today’s data centers. The hardware trend is moving to disaggregated storage and compute architecture in order to improve cost efficiency and scalability.

Chenzhao Guo and Carson Wang outline how to address the challenges in Spark shuffle and support disaggregated storage and compute architecture by implementing a new Spark shuffle manager. The new architecture supports writing shuffle data to a remote cluster with different storage backends. The failure of compute node no longer causes recomputation of the shuffle data. Spark executors can also be allocated and recycled dynamically, resulting in better resource use.

For most people running Spark with collocated storage, it’s usually challenging to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. This new shuffle manager enables building a separated cluster for storing and serving the shuffle data by leveraging the latest hardware to improve the performance and reliability. In high-performance computing (HPC) world, more people are starting to use Spark, and this work is also important for them as storage and compute in HPC clusters are typically disaggregated. You’ll leave with an overview of the challenges in the current Spark shuffle implementation and the design of the new shuffle manager. Chenzhao and Carson also present a performance study of the work.

Prerequisite knowledge

  • A basic understanding of Spark shuffle

What you'll learn

  • Understand the essence of Spark shuffle and disaggregated architecture
Photo of Chenzhao Guo

Chenzhao Guo

Intel

Chenzhao Guo is a big data software engineer at Intel. He’s currently a contributor of Spark and a committer of OAP and HiBench. He graduated from Zhejiang University.

Photo of Carson Wang

Carson Wang

Intel

Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He’s an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  • Infoworks.io, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires