Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

Vijay Agneeswaran (Walmart Labs)
16:3517:15 Thursday, 25 May 2017
Level: Beginner

Who is this presentation for?

  • Data engineers, architects, and scientists

Prerequisite knowledge

  • A basic understanding of data engineering and analytics

What you'll learn

  • Gain an overview of a media campaign management tool
  • Explore Druid, Apache HAWQ, and Kinetica basic concepts
  • Understand how the three data stores on AWS cloud infrastructure compare


The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. This class is characterized by nonscalar aggregates all the way to the root of the merge tree—equivalent to a set union operation in SQL at every level of the tree. Typical big data technologies were mostly supporting only scalar aggregates. The set union operation must be implemented outside of the data store, resulting in nonstandard implementation and consequent inefficiencies.

Vijay Srinivas Agneeswaran explores a prototype built on top of Druid, one of the claimants to the throne of analytical data processing, to illustrate the problem. Druid supports only scalar aggregates; as a result, the set union operation had to be implemented at the application level. Data transfer into and out of Druid and the complexity of thread processing at the Java layer led to inefficiencies, resulting in a computation time of 200+ seconds.

With its ability to perform multidimensional partitioning of data, support for full SQL queries (and, consequently, support for set union operations), and its efficient distributed query optimization techniques, Apache HAWQ looked like the ideal candidate for this use case. However, HAWQ’s dependence on Hadoop as the underlying filesystem plus the inherent complexity of the computation led to poorer than expected results. HAWQ took about 100 seconds to process the same query, but the SLA was less than 10 seconds.

It turned out that the multidimensional partitioning was inefficient. Vijay explains how this problem was solved through multiple HAWQ clusters and an intelligent client that stores metadata to route queries to appropriate clusters. By ensuring each HAWQ cluster is independent, the time to execute the query was reduced to 30 seconds.

Vijay then explores an implementation of the same query with a GPU database (Kinetica) to benchmark its performance on an Amazon g2.8x instance. The response time for the same query was around 12 seconds—and with a bit more optimization, the SLA will be met.

Photo of Vijay Agneeswaran

Vijay Agneeswaran

Walmart Labs

Vijay Srinivas Agneeswaran is a senior director of technology at Publicis Sapient. Vijay has spent the last 12 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that built a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He’s a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Comments on this page are now closed.


Picture of Vijay Agneeswaran
17/05/2017 9:26 BST


Just one small correction in the above abstract. I shall be explaining the performance studies we did on P2.8X (and not G2.8) instances on AWS using Kinetica, one of the few distributed GPU databases available and comparing this with the implementations on top of Druid and Apache HAWQ.