Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Big data analytics in the public cloud: Challenges and opportunities

Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)
11:1511:55 Thursday, 2 May 2019
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Big data software architects and those working at public cloud service providers

Level

Intermediate

Prerequisite knowledge

  • Familiarity with big data, the cloud, and benchmarking

What you'll learn

  • Understand performance gaps in the public cloud
  • Explore an in-memory data accelerator built with new hardware like persistent memory and RDMA NICs that improves the performance of big data analytics workloads in the cloud and enables new use cases

Description

Cloud-based big data analytics is growing faster than traditional on-premises solutions, as it provides excellent scalability, simplifies management, and reduces costs. Public cloud adoption has become the top priority for big data investments. However, performance and feature gaps still exist that must be resolved.

Jian Zhang, Chendi Xue, and Yuan Zhou explore the performance and feature challenges caused by migrating big data analytics workloads to the cloud, including disaggregated object storage commonly used by public CSPs, cloud connectors for big data and the cloud, and compute service orchestration (e.g., running Spark on Kubernetes). They then share the evolution of big data analytics in the public cloud, reveal the root cause of performance gaps of typical workloads (TeraSort, DFSIO, TPC-DS, and k-means) in different scenarios. They conclude with a discussion of a new in-memory data accelerator: high-performance layer leveraging state-of-the-art technologies like persistent memory and RDMA to accelerate ephemeral data access. You’ll see promising performance numbers on prototypes that illustrate how this approach enables hybrid transactional analytical processing (HTAP) workloads in the cloud. Along the way, you’ll learn how to leverage new hardware technologies like persistent memory and RDMA for big data analytics in the cloud.

Photo of Jian Zhang

Jian Zhang

Intel

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Photo of Chendi Xue

Chendi Xue

Intel

Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.

Photo of Yuan Zhou

Yuan Zhou

Intel

Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He’s been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.