Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)
3:50pm4:30pm Thursday, March 28, 2019
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Software engineers

Level

Intermediate

Prerequisite knowledge

  • Familiarity with Spark shuffle, persistent memory, and RDMA

What you'll learn

  • Understand Persistent Memory over Fabric
  • Learn how to optimize Spark shuffle

Description

As a unified data processing engine, Spark is expected to achieve high throughput and ultralow latency for different workloads like ad hoc queries, real-time streaming, and machine learning. However, under certain workloads (large join/aggregation), its performance is limited by the overhead from the persistence on local shuffle drives and transferring with TCP/IP networking. Previous studies showed this can be improved using RDMA networking and fast storage like NVMe SSDs, which should have orders of magnitude improvements, but the performance gain didn’t go that much due to the long I/O stack in the shuffle stage. Thanks to the new DCPMM technology, which offers persistency with memory-like speed, we’re able to shorten the I/O stack and make Spark a 100% in-memory computing platform.

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. An open source, community-driven project contributed by storage and big data engineers, Spark-PMoF leverages PMoF (persistent memory over fabric) technology and enables the codesign of a storage/network stack to speed up Spark shuffle performance. By using P2P-connected persistent shuffle storage with memory-like speed, you can fully bypass the context switch and greatly improve big data analytics performance without hurting any Spark consistency. Initial benchmark results using microworkloads show Spark-PMOF achieves great improvements.

Photo of Yuan Zhou

Yuan Zhou

Intel

Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He’s been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.

Photo of haodong tang

haodong tang

Intel

Haodong Tang is a big data storage optimization and development engineer at Intel.

Photo of Jian Zhang

Jian Zhang

Intel

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.