San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)

3:50pm–4:30pm Thursday, March 28, 2019

Data Engineering & Architecture
Location: 2008

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Average rating:

(3.33, 3 ratings)

Download slides (PPTX)

Who is this presentation for?

Software engineers

Level

Intermediate

Prerequisite knowledge

Familiarity with Spark shuffle, persistent memory, and RDMA

What you'll learn

Understand Persistent Memory over Fabric
Learn how to optimize Spark shuffle

Description

As a unified data processing engine, Spark is expected to achieve high throughput and ultralow latency for different workloads like ad hoc queries, real-time streaming, and machine learning. However, under certain workloads (large join/aggregation), its performance is limited by the overhead from the persistence on local shuffle drives and transferring with TCP/IP networking. Previous studies showed this can be improved using RDMA networking and fast storage like NVMe SSDs, which should have orders of magnitude improvements, but the performance gain didn’t go that much due to the long I/O stack in the shuffle stage. Thanks to the new DCPMM technology, which offers persistency with memory-like speed, we’re able to shorten the I/O stack and make Spark a 100% in-memory computing platform.

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. An open source, community-driven project contributed by storage and big data engineers, Spark-PMoF leverages PMoF (persistent memory over fabric) technology and enables the codesign of a storage/network stack to speed up Spark shuffle performance. By using P2P-connected persistent shuffle storage with memory-like speed, you can fully bypass the context switch and greatly improve big data analytics performance without hurting any Spark consistency. Initial benchmark results using microworkloads show Spark-PMOF achieves great improvements.

Yuan Zhou

Intel

Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He’s been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.

haodong tang

Intel

Haodong Tang is a big data storage optimization and development engineer at Intel.

Jian Zhang

Intel

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com