Cloud-based big data analytics is growing faster than traditional on-premises solutions, as it provides excellent scalability, simplifies management, and reduces costs. Public cloud adoption has become the top priority for big data investments. However, performance and feature gaps still exist that must be resolved.
Jian Zhang, Chendi Xue, and Yuan Zhou explore the performance and feature challenges caused by migrating big data analytics workloads to the cloud, including disaggregated object storage commonly used by public CSPs, cloud connectors for big data and the cloud, and compute service orchestration (e.g., running Spark on Kubernetes). They then share the evolution of big data analytics in the public cloud, reveal the root cause of performance gaps of typical workloads (TeraSort, DFSIO, TPC-DS, and k-means) in different scenarios. They conclude with a discussion of a new in-memory data accelerator: high-performance layer leveraging state-of-the-art technologies like persistent memory and RDMA to accelerate ephemeral data access. You’ll see promising performance numbers on prototypes that illustrate how this approach enables hybrid transactional analytical processing (HTAP) workloads in the cloud. Along the way, you’ll learn how to leverage new hardware technologies like persistent memory and RDMA for big data analytics in the cloud.
Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.
Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.
Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He’s been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com