Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Planning your SQL-on-Hadoop cluster for a multiuser environment with heterogeneous and concurrent query workloads

Jun Liu (Intel), Zhaojuan Bian (Intel)
1:15pm–1:55pm Wednesday, 09/28/2016
Hadoop use cases
Location: 3D 08 Level: Beginner
Average rating: **...
(2.00, 2 ratings)

Prerequisite knowledge

  • A general understanding of SQL-on-Hadoop systems
  • Familiarity with deployment planning and why people are doing it for their big data systems (useful but not required)
  • What you'll learn

  • Understand common big data deployment issues
  • Learn how to use a cluster-level big data simulation tool to help successfully deploy your big data system for production in a minimum amount of time
  • Description

    Handling heterogeneous and concurrent query workloads in a multiuser environment is a common use scenario for BI analytics over SQL-on-Hadoop systems. Properly deploying a SQL-on-Hadoop cluster that provides the best performance in such an environment requires extensive knowledge of the workloads, overall resource utilization, database table design, software stack configurations, and hardware settings. An improperly planned deployment can lead to an underutilized cluster, wasting company assets or failing to meet performance requirements. In one real-world example, a company deployed an 80-node cluster; however, their workloads and data volume required less than half of the nodes to meet their performance requirement. This means more than half of the nodes sat doing nothing but waiting to be depreciated. In another, a company used more expensive SSDs even though, at the software level, operations were single threaded and bottlenecked by CPU rather than I/O—thus, HDDs might have been a better choice for deployment. Of course, in some scenarios, SSDs might improve the overall query execution time by more than 50%, so there is really no one-size-fits-all solution.

    As these examples suggest, many challenges exist in designing an SQL-on-Hadoop cluster for production in a multiuser environment with heterogeneous and concurrent query workloads. Jun Liu and Zhaojuan Bian draw on their personal experience to address these challenges, explaining how to determine the right size of your cluster with different combinations of hardware and software resources using a simulation-based approach.

    Topics include:

    • An introduction to Impala and its resource management mechanism
    • Deployment challenges of an Impala cluster: Selecting the best combination of software configurations and choosing hardware settings and sizing
    • A case study: Planning your Impala system in a multiuser environment
    Photo of Jun Liu

    Jun Liu


    Jun Liu is a senior performance engineer in Intel’s Software and Service group, where he works in the area of big data performance modeling and simulation, especially SQL-on-Hadoop systems. Before Intel, Jun was a postdoctoral researcher and senior member of the Database Performance and Migration group (DPMG) at Dublin City University. His primary research focus area is data migration and database performance optimization. Jun also worked as a software engineer at Ericsson and has participated in the development of different projects in the areas of real-time complex events processing and big data analysis. Jun holds a PhD in computing from Dublin City University, an MSc in advanced software engineering from University College Dublin, and a BSc in computer science from Dublin Institution of Technology.

    Photo of Zhaojuan Bian

    Zhaojuan Bian


    Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.

    Comments on this page are now closed.


    Picture of Alex Rivlin
    Alex Rivlin
    10/02/2016 8:33pm EDT

    Can you please share the link to the slides?
    Thank you