Bridging the gap between big data computing and high-performance computing
Who is this presentation for?Data engineers, Data Scientists
Big data computing and high-performance computing (HPC) has evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms are increasingly embracing each other for data management and algorithms. For example, public clouds such as Microsoft Azure are adding High performance compute instances with Infiniband and large scale deployments of GPUs in HPC clusters, enable Artificial Intelligence algorithms on large data sets. In the future, we can expect more and more applications to explore the benefits of HPC while taking advantage of big data systems. This talk guides a listener through the differences between HPC systems and big data systems and outlines the areas they can benefit from each other. Also, it presents performance differences, usability based motivations and architectural differences between systems and frameworks that can guide a user towards picking correct solutions.
Understanding the evolution of big data systems and HPC systems helps to define the key differences; the goals behind them and their architectures. There are four broad application classes that are driving the requirements of data analytics tools and frameworks. They are data pipelines, large scale machine learning including deep learning applications, streaming applications, and graph applications. Historically HPC systems have given lesser focus to data management and more focus on designing high-performance algorithms. Big data systems have done an excellent job in data management, data queries, and streaming applications. Research has shown that machine learning, deep learning and graph algorithms can immensely benefit from HPC systems.
Parallel operators are one of the key foundational blocks in a distributed computing system. Map-reduce is one of the most well known parallel operators and there are many more such as gather, partition, scatter. These operators help to distribute data among parallel tasks and have a consensus among them when a computation progresses. HPC systems have their own parallel operators with similar semantics to big data systems. There are many possible optimizations for these operators and they are a huge factor in any distributed application. Advanced hardware such as Infiniband plays another big role in HPC applications. They provide low latency, high throughput networking among a large number of nodes. Such networks are vital to scale applications to thousands of nodes and tens of thousands of CPU cores.
The way iterations are programmed and executed is the other major difference between HPC and big data systems. Iterations are a key component in complex applications and one of the success points behind Spark over Hadoop. There are many ways to handle an iteration in different programming models and systems. For example, in Spark, the iterations are handled in a central place (driver), Flink embeds iterations into the dataflow graph and HPC system distribute iterations to each worker. Each of these choices has different implications for programming models and performance. The presentation compares these differences and discusses the performance differences, usability and fault tolerance aspects of such choices.
Programming APIs and data abstractions are quite different between big data and HPC systems. HPC systems have adopted low-level APIs while big data systems have adopted high-level user-friendly API’s. The performance and usability is a delicate balance in any system and achieving performance while preserving usability is a challenge. The talk gives examples of having big data API’s around HPC systems and integration of HPC techniques to big data systems explore these points.
Prerequisite knowledge1. Basic knowledge about big data frameworks such as Map-Reduce, Spark, Flink or Storm 2. Knowledge about HPC systems is a plus but not required 3. Knowledge about OS, hardware is a plus
What you'll learn
Supun Kamburugamuve has a PhD in computer science specializing in high performance data analytics at Indiana University. He is working as a software architect at Digital Science Center of Indiana University where he researches big data applications and frameworks. Recently, he has been working on high-performance enhancements to big data systems with HPC interconnect such as Infiniband and Omnipath. Supun is an elected member of Apache Software Foundation and has contributed to many open source projects including Apache Web Services projects. Before joining Indiana University, Supun worked on middle-ware systems and was a key member of a WSO2 ESB, which is an open source enterprise integration product which is being widely used by enterprises.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts