Presented by O'Reilly and Cloudera
Make Data Work
July 12-13, 2017: Training
July 13-15, 2017: Tutorials & Conference
Beijing, China

Alluxio缓存策略优化与大规模性能评测 (Optimizing Alluxio cache strategy and large-scale performance evaluation)

此演讲使用中文 (This will be presented in Chinese)

Rong Gu (南京大学)
14:50–15:30 Saturday, 2017-07-15
Spark及更多发展 (Spark & beyond)
Location: 多功能厅2(Function Room 2) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)

对分布式存储系统,例如HDFS、Alluxio有基本的了解。

您将学到什么 (What you'll learn)

能够量化地了解不同的缓存策略下对上层大数据应用性能影响; 能够学习如何量化地评测分布式文件系统的性能;

描述 (Description)

Alluxio(原名Tachyon)是一个开源的、以内存为中心的虚拟的分布式存储系统。在大数据生态系统中,Alluxio介于计算框架(如Apache Spark,Apache MapReduce,Apache Flink)和现有的存储系统(如Amazon S3,OpenStack Swift,GlusterFS,HDFS, Ceph,OSS)之间。 Alluxio为大数据软件栈带来了显著的性能提升。例如,百度采用Alluxio使他们数据分析流水线的吞吐量提升了30倍。 巴克莱银行使用Alluxio将他们的作业分析的耗时从小时级降到秒级。 除性能外,Alluxio为新型大数据应用作用于传统存储系统的数据建立了桥梁,支持多种环境部署。

Alluxio提供了分层存储机制,对统一管理了集群中的内存、SSD 和HDD等存储资源,能够让更大的数据集存储在Alluxio上。为了使热数据尽量在更快的存储层上,我们在Alluxio中针对多种大数据的应用场景设计实现了除LRU之外众多高级的缓存替换策略包括LIRS、LRFU、ARC等。其中,LIRS策略适合迭代式大数据应用、LRFU策略可以在经典的LRU策略和LFU策略之间进行更好的平衡、ARC策略能够自适应上层应用数据访问特性的变化。在本演讲中,我们将详细阐述这些高级缓存算法的特性及其对不同的大数据应用使用时需要注意的情况。

Alluxio-Perf是对Alluxio及其应用进行大规模性能评测调优的工具。Alluxio-Perf是以主从式的架构分布式运行的,支持多节点、多进程、多线程多种并发运行模式。Alluxio-Perf自带了一系列典型的对Alluxio进行操作的应用benchmark,包括顺序读写、随机读、元数据操作、迭代读写、混合模式读写等。此外,用户还可以根据自己的应用很容易地向Alluxio-Perf里面定制化test case,从而对自己的应用场景进行性能调优。本演讲中,我们在Alluxio-Perf中定制化开发了采用不同的缓存策的大数据应用测试用例(机器学习、查询),并测试对比了各个缓存策略在不同场景下大规模读写情景下的性能差异。最后,我们通过Spark SQL和Spark MLlib的真实应用验证了本演讲提出的各种缓存策略所适用场景的结论。


Alluxio (formerly known as Tachyon) is an open source, memory-centric, virtual distributed storage system. In the big data ecosystem, Alluxio is used in the middle of computing frameworks such as Apache Spark, Apache MapReduce, and Apache Flink and current storage systems such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, Ceph, and OSS. Alluxio has brought significant performance improvements for big data software stacks. For example, Baidu uses Alluxio to increase the throughput of its data analysis pipeline by a factor of 30. Barclays used Alluxio to reduce the time of analysis tasks from hours to seconds. Besides performance improvements, Alluxio establishes a bridge between the new type of big data applications and the traditional storage systems and supports deployment on a variety of environments.

Alluxio provides a tiered storage mechanism that uniformly manages storage resources (memory, SSD, HDD, etc.), allowing bigger datasets to be stored on Alluxio. In order for hot data to be stored in the fast layer, Alluxio implements several advanced cache replacement strategies, including LIRS, LRFU, and ARC, for a variety of big data applications. The LIRS strategy is suitable for iterative big data applications; the LRFU strategy can balance the classic LRU strategy and LFU strategy; the ARC strategy can adapt to the changes of data access features in the upper-level application. Rong Gu details the features of these advanced cache algorithms and explains when to use them.

Alluxio-Perf is a tool for large-scale performance evaluation and tuning for Alluxio and its applications. Alluxio-Perf runs in a master-slave distributed architecture and supports multinode, multiprocess, multithread concurrent execution modes. Alluxio-Perf comes with a series of typical benchmarks for Alluxio operations, including sequential reads and writes, random reads, metadata operations, iterative read and write, and mixed-mode read and write. In addition, users can easily customize their own test cases in Alluxio-Perf to optimize the performance for their own application scenarios. Rong shares big data application test cases (machine learning, query) using different caching strategies in Alluxio-Perf and compares the performance in different large-scale read and write scenarios, all validated through real applications of Spark SQL and Spark MLlib.

Photo of Rong Gu

Rong Gu

南京大学

顾荣,博士毕业于南京大学计算机系,现就职于南大计算机系,大数据开源存储项目Alluxio PMC member and mainitainer,Apache Spark contributor。作为知名的Alluxio社区开发者,顾荣完成了Alluxio很多功能稳定和性能增强方面的工作,包括性能测试框架Alluxio-Perf、Alluxio与Hadoop生态系统多个组件的整合、开发社区中文文档等。在与Spark结合方面,顾荣还设计实现了Spark 1.0版本中发布的支持RDD 存储到Alluxio的功能。顾荣目前已经发表或录用论文十余篇(其中10篇第一作者),并且参与编写《深入理解大数据—卷1: 大数据处理与编程实践》、《实战Hadoop:开启通向云计算的捷径》等书籍中的部分章节。顾荣热衷于技术分享,是南京大数据技术Meetup的组织人(已举行7次活动),也多次在国内知名的技术大会(例如中国数据库技术大会)上进行技术演讲。此外,顾荣曾在Microsoft Research、Intel、Baidu、星环科技(Transwarp)从事过大数据系统研发实习工作。

Connect with O'ReillyData

Use the QR Code to follow OReillyData and get the latest conference information and browse data articles.

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

Read the latest ideas on big data.

ORB Data Site