Handling heterogeneous and concurrent query workloads in a multiuser environment is a common use scenario for BI analytics over SQL-on-Hadoop systems. Properly deploying a SQL-on-Hadoop cluster that provides the best performance in such an environment requires extensive knowledge of the workloads, overall resource utilization, database table design, software stack configurations, and hardware settings. An improperly planned deployment can lead to an underutilized cluster, wasting company assets or failing to meet performance requirements. In one real-world example, a company deployed an 80-node cluster; however, their workloads and data volume required less than half of the nodes to meet their performance requirement. This means more than half of the nodes sat doing nothing but waiting to be depreciated. In another, a company used more expensive SSDs even though, at the software level, operations were single threaded and bottlenecked by CPU rather than I/O—thus, HDDs might have been a better choice for deployment. Of course, in some scenarios, SSDs might improve the overall query execution time by more than 50%, so there is really no one-size-fits-all solution.
As these examples suggest, many challenges exist in designing an SQL-on-Hadoop cluster for production in a multiuser environment with heterogeneous and concurrent query workloads. Jun Liu and Zhaojuan Bian draw on their personal experience to address these challenges, explaining how to determine the right size of your cluster with different combinations of hardware and software resources using a simulation-based approach.
Jun Liu is a senior performance engineer in Intel’s Software and Service group, where he works in the area of big data performance modeling and simulation, especially SQL-on-Hadoop systems. Before Intel, Jun was a postdoctoral researcher and senior member of the Database Performance and Migration group (DPMG) at Dublin City University. His primary research focus area is data migration and database performance optimization. Jun also worked as a software engineer at Ericsson and has participated in the development of different projects in the areas of real-time complex events processing and big data analysis. Jun holds a PhD in computing from Dublin City University, an MSc in advanced software engineering from University College Dublin, and a BSc in computer science from Dublin Institution of Technology.
Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.