Get the free Ebook:
Private and Open Data in Asia: A Regional Guide.
According to our previous performance analysis study, the performance of an Impala cluster varies significantly with different workload setups, hardware selections, cluster topologies, and software parameters. For example, extending the system with more nodes or by adding extra disks can accelerate the execution of most TPC-DS queries. However, a four-node cluster can be 20 percent faster than a five-node cluster for some queries when data exchange between nodes is expensive. Furthermore, when performance is bottlenecked by processors, increasing disk counts can even hurt query speed by 16 percent. It is not a trivial task to figure out all these without performing massive amounts of tests. As we will demonstrate, a simulation-based approach provides a more flexible way to test all such scenarios.
In this talk, we present an Impala simulator that can be used for capacity planning, optimization, and scaling analysis. The simulator models the behavior of a complete software stack and simulates the activities of cluster components, including storage, network, and processors. It provides a flexible way of evaluating different data schemas (e.g. table/fields design and data partitioning), software configurations (e.g. HDFS block size, file format, and compression types), and hardware setups (e.g. CPU type and frequency, storage device types and speed, and cluster size).
We will walk through a real-world example of using an Impala simulator to design an Impala cluster in the banking and financial services sector.
Topics in this session include:
Jun Liu is a senior performance engineer in Intel’s Software and Service group, where he works in the area of big data performance modeling and simulation, especially SQL-on-Hadoop systems. Before Intel, Jun was a postdoctoral researcher and senior member of the Database Performance and Migration group (DPMG) at Dublin City University. His primary research focus area is data migration and database performance optimization. Jun also worked as a software engineer at Ericsson and has participated in the development of different projects in the areas of real-time complex events processing and big data analysis. Jun holds a PhD in computing from Dublin City University, an MSc in advanced software engineering from University College Dublin, and a BSc in computer science from Dublin Institution of Technology.
Zhaojuan Bianny Bian is an engineering manager in Intel’s Software and Service Group, where she focuses on big data cluster modeling to provide services in cluster deployment, hardware projection, and software optimization. Bianny has more than 10 years of experience in the industry with performance analysis experience that spans big data, cloud computing, and traditional enterprise applications. She holds a master’s degree in computer science from Nanjing University in China.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.