The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. This class is characterized by nonscalar aggregates all the way to the root of the merge tree—equivalent to a set union operation in SQL at every level of the tree. Typical big data technologies were mostly supporting only scalar aggregates. The set union operation must be implemented outside of the data store, resulting in nonstandard implementation and consequent inefficiencies.
Vijay Srinivas Agneeswaran explores a prototype built on top of Druid, one of the claimants to the throne of analytical data processing, to illustrate the problem. Druid supports only scalar aggregates; as a result, the set union operation had to be implemented at the application level. Data transfer into and out of Druid and the complexity of thread processing at the Java layer led to inefficiencies, resulting in a computation time of 200+ seconds.
With its ability to perform multidimensional partitioning of data, support for full SQL queries (and, consequently, support for set union operations), and its efficient distributed query optimization techniques, Apache HAWQ looked like the ideal candidate for this use case. However, HAWQ’s dependence on Hadoop as the underlying filesystem plus the inherent complexity of the computation led to poorer than expected results. HAWQ took about 100 seconds to process the same query, but the SLA was less than 10 seconds.
It turned out that the multidimensional partitioning was inefficient. Vijay explains how this problem was solved through multiple HAWQ clusters and an intelligent client that stores metadata to route queries to appropriate clusters. By ensuring each HAWQ cluster is independent, the time to execute the query was reduced to 30 seconds.
Vijay then explores an implementation of the same query with a GPU database (Kinetica) to benchmark its performance on an Amazon g2.8x instance. The response time for the same query was around 12 seconds—and with a bit more optimization, the SLA will be met.
Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He currently heads data sciences R&D at Walmart Labs, India. He has spent the last eighteen years creating intellectual property and building data-based products in Industry and academia. In his current role, he heads machine learning platform development and data science foundation teams, which provide platform/intelligent services for Walmart businesses across the world. In the past, he has led the team that delivered real-time hyper-personalization for a global auto-major as well as other work for various clients across domains such as retail, banking/finance, telecom, automotive etc. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has five full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, artificial intelligence as well as Big-Data and other emerging technologies.
Comments on this page are now closed.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org