As a unified data processing engine, Spark is expected to achieve high throughput and ultralow latency for different workloads like ad hoc queries, real-time streaming, and machine learning. However, under certain workloads (large join/aggregation), its performance is limited by the overhead from the persistence on local shuffle drives and transferring with TCP/IP networking. Previous studies showed this can be improved using RDMA networking and fast storage like NVMe SSDs, which should have orders of magnitude improvements, but the performance gain didn’t go that much due to the long I/O stack in the shuffle stage. Thanks to the new DCPMM technology, which offers persistency with memory-like speed, we’re able to shorten the I/O stack and make Spark a 100% in-memory computing platform.
Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. An open source, community-driven project contributed by storage and big data engineers, Spark-PMoF leverages PMoF (persistent memory over fabric) technology and enables the codesign of a storage/network stack to speed up Spark shuffle performance. By using P2P-connected persistent shuffle storage with memory-like speed, you can fully bypass the context switch and greatly improve big data analytics performance without hurting any Spark consistency. Initial benchmark results using microworkloads show Spark-PMOF achieves great improvements.
Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He’s been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.
Haodong Tang is a big data storage optimization and development engineer at Intel.
Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com