Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Designing an SQL-on-Hadoop cluster using Impala simulator: A use case for the banking and financial services sector

Jun Liu (Intel), Zhaojuan Bian (Intel)
11:50am–12:30pm Wednesday, 12/02/2015
Hadoop Platform
Location: 334-335 Level: Intermediate
Average rating: ***..
(3.86, 7 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

The audience should have some basic knowledge of SQL-on-Hadoop systems, i.e. how to query a database using SQL-on-Hadoop.

Description

According to our previous performance analysis study, the performance of an Impala cluster varies significantly with different workload setups, hardware selections, cluster topologies, and software parameters. For example, extending the system with more nodes or by adding extra disks can accelerate the execution of most TPC-DS queries. However, a four-node cluster can be 20 percent faster than a five-node cluster for some queries when data exchange between nodes is expensive. Furthermore, when performance is bottlenecked by processors, increasing disk counts can even hurt query speed by 16 percent. It is not a trivial task to figure out all these without performing massive amounts of tests. As we will demonstrate, a simulation-based approach provides a more flexible way to test all such scenarios.

In this talk, we present an Impala simulator that can be used for capacity planning, optimization, and scaling analysis. The simulator models the behavior of a complete software stack and simulates the activities of cluster components, including storage, network, and processors. It provides a flexible way of evaluating different data schemas (e.g. table/fields design and data partitioning), software configurations (e.g. HDFS block size, file format, and compression types), and hardware setups (e.g. CPU type and frequency, storage device types and speed, and cluster size).

We will walk through a real-world example of using an Impala simulator to design an Impala cluster in the banking and financial services sector.

Topics in this session include:

  • Introduce how to build simulation models for SQL on Hadoop system
  • Introduce how to plan an Impala cluster through simulation approach
  • Table schema, data placement and partition, file formats and compression selections
  • Cluster sizing and hardware selections
  • Software parameter tuning
  • Use case study: planning, tuning, and optimizing a banking use case
Photo of Jun Liu

Jun Liu

Intel

Jun Liu is a senior performance engineer in Intel’s Software and Service group, where he works in the area of big data performance modeling and simulation, especially SQL-on-Hadoop systems. Before Intel, Jun was a postdoctoral researcher and senior member of the Database Performance and Migration group (DPMG) at Dublin City University. His primary research focus area is data migration and database performance optimization. Jun also worked as a software engineer at Ericsson and has participated in the development of different projects in the areas of real-time complex events processing and big data analysis. Jun holds a PhD in computing from Dublin City University, an MSc in advanced software engineering from University College Dublin, and a BSc in computer science from Dublin Institution of Technology.

Photo of Zhaojuan Bian

Zhaojuan Bian

Intel

Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.