Bridging the gap between big data computing and high-performance computing

Supun Kamburugamuve (Indiana University)

4:35pm–5:15pm Thursday, September 26, 2019

Location: 1A 23/24

Data Engineering and Architecture

Secondary topics: Data, Analytics, and AI Architecture

Average rating:

(3.00, 1 rating)

Who is this presentation for?

Data engineers and data scientists

Level

Intermediate

Description

Big data computing and high-performance computing (HPC) has evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms increasingly embrace each other for data management and algorithms. For example, public clouds such as Microsoft Azure are adding high-performance compute instances with InfiniBand and large-scale deployments of GPUs in HPC clusters, enabling artificial intelligence algorithms on large datasets. In the future, you can expect more applications to explore the benefits of HPC while taking advantage of big data systems. Supun Kamburugamuve walks you through the differences between HPC systems and big data systems, outlining the areas they can benefit from each other. He also presents performance differences, usability-based motivations, and architectural differences between systems and frameworks that can guide you toward picking correct solutions.

Understanding the evolution of big data systems and HPC systems helps to define the key differences, the goals behind them, and their architectures. There are four broad application classes that driving the requirements of data analytics tools and frameworks. They’re data pipelines, large-scale machine learning—including deep learning applications—streaming applications, and graph applications. Historically, HPC systems have given less focus to data management and more focus to designing high-performance algorithms. Big data systems have done an excellent job in data management, data queries, and streaming applications. Research has shown that machine learning, deep learning, and graph algorithms can immensely benefit from HPC systems.

Parallel operators are one of the key foundational blocks in a distributed computing system. MapReduce is one of the most well-known parallel operators, and there are many more, such as gather, partition, and scatter. These operators help distribute data among parallel tasks and have a consensus when a computation progresses. HPC systems have their own parallel operators with similar semantics as big data systems. There are many possible optimizations for these operators, and they are a huge factor in any distributed application. Advanced hardware such as InfiniBand plays another big role in HPC applications. They provide low-latency, high-throughput networking among a large number of nodes. Such networks are vital to scale applications to thousands of nodes and tens of thousands of CPU cores.

The way iterations are programmed and executed is the other major difference between HPC and big data systems. Iterations are a key component in complex applications and one of the success points behind Spark over Hadoop. There are many ways to handle an iteration in different programming models and systems. For example, in Spark, the iterations are handled in a central place (driver), Flink embeds iterations into the data flow graph and the HPC system distributes iterations to each worker. Each of these choices has different implications for programming models and performance. Supun compares these differences and explores the performance differences, usability, and fault tolerance aspects.

Programming APIs and data abstractions are quite different between big data and HPC systems. HPC systems have adopted low-level APIs while big data systems have adopted high-level user-friendly APIs. The performance and usability is a delicate balance in any system, and achieving performance while preserving usability is a challenge. You’ll identify examples of big data APIs around HPC systems and the integration of HPC techniques to big data systems to explore these points.

Prerequisite knowledge

General knowledge of big data frameworks such as MapReduce, Spark, Flink, or Storm
Familiarity with HPC systems, OS, and hardware (useful but not required)

What you'll learn

Identify the importance of big data computing and HPC working together to solve larger problems
Explore the performance differences, usability, and architectural differences of the big data and HPC systems
Discover the opportunities and tools available for building solutions that take advantage of both tools

Supun Kamburugamuve

Indiana University

Supun Kamburugamuve is a graduate student at Indiana University and a senior software architect at the Digital Science Center of Indiana University, where he researches big data applications and frameworks. He’s working on high-performance enhancements to big data systems with HPC interconnect such as InfiniBand and Omni-Path. Supun is an elected member of Apache Software Foundation and has contributed to many open source projects including Apache Web Services projects. Previously, Supun worked on middleware systems and was a key member of a WSO2 enterprise service bus (ESB), an open source enterprise integration product widely used by enterprises. He has a PhD in computer science, specializing in high-performance data analytics at Indiana University.