The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. This class is characterized by nonscalar aggregates all the way to the root of the merge tree—equivalent to a set union operation in SQL at every level of the tree. Typical big data technologies were mostly supporting only scalar aggregates. The set union operation must be implemented outside of the data store, resulting in nonstandard implementation and consequent inefficiencies.
Vijay Srinivas Agneeswaran explores a prototype built on top of Druid, one of the claimants to the throne of analytical data processing, to illustrate the problem. Druid supports only scalar aggregates; as a result, the set union operation had to be implemented at the application level. Data transfer into and out of Druid and the complexity of thread processing at the Java layer led to inefficiencies, resulting in a computation time of 200+ seconds.
With its ability to perform multidimensional partitioning of data, support for full SQL queries (and, consequently, support for set union operations), and its efficient distributed query optimization techniques, Apache HAWQ looked like the ideal candidate for this use case. However, HAWQ’s dependence on Hadoop as the underlying filesystem plus the inherent complexity of the computation led to poorer than expected results. HAWQ took about 100 seconds to process the same query, but the SLA was less than 10 seconds.
It turned out that the multidimensional partitioning was inefficient. Vijay explains how this problem was solved through multiple HAWQ clusters and an intelligent client that stores metadata to route queries to appropriate clusters. By ensuring each HAWQ cluster is independent, the time to execute the query was reduced to 30 seconds.
Vijay then explores an implementation of the same query with a GPU database (Kinetica) to benchmark its performance on an Amazon g2.8x instance. The response time for the same query was around 12 seconds—and with a bit more optimization, the SLA will be met.
Vijay Srinivas Agneeswaran is a senior director of technology at Publicis Sapient. Vijay has spent the last 12 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.
Comments on this page are now closed.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org