O'Reilly、Cloudera 主办
Make Data Work

在Spark上使用Alluxio的最佳实践(Best practices for using Alluxio with Spark)

此演讲使用中文 (This will be presented in Chinese)

Yupeng Fu (Alluxio)
14:00–14:40 Saturday, 2017-07-15
Spark及更多发展 (Spark & beyond)
地点: 紫金大厅B(Grand Hall B) 观众水平 (Level): Intermediate

必要预备知识 (Prerequisite Knowledge)

Basic knowledge of Spark and big data architecture

您将学到什么 (What you'll learn)

Learn how Alluxio can be used with Spark effectively to improve job completion time

描述 (Description)

Alluxio(原名Tachyon)是内存级速度的虚拟分布式存储系统。它利用内存来存储数据和提升在不同存储系统上的数据访问的速度。 很多的机构和应用已经配合使用Apache Spark和Alluxio。其中一些已经扩展到超过PB级的数据上。

Alluxio可以使Spark在企业私有环境和公有云中的部署更加有效。 Alluxio将Spark应用程序与各种存储系统结合在一起并进一步加速数据密集型应用,同时还为各种不同存储系统上的数据提供了统一的命名空间,为应用程序开发人员提供了便利。 Alluxio还使用内存来为需要快速访问重要数据的应用存储热数据。 虽然Spark拥有自己的内存缓存,但Alluxio的内存存储可以进一步改善Spark应用。

Gene Pang和Bin Fan将会解释Alluxio如何让Spark更有效,并会分享Alluxio和Spark配合使用的生产系统上的部署案例。Gene和Bin还会讨论使用Alluxio与Spark的最佳实践,包括RDD和DataFrame,以及在企业私有环境和公有云上进行部署。

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PBs of data.

Alluxio can enable Spark to be even more effective in both on-premises deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications and provides a unified namespace of data from various different storage systems, which is convenient for application developers. Alluxio also uses memory to store hot data for applications for fast access to important data. And although Spark has its own in-memory cache, Alluxio’s in-memory storage can further improve Spark applications.

Yupeng Fu explains how Alluxio helps Spark be more effective and shares examples of production deployments of Alluxio and Spark working together. Yupeng also discusses best practices for using Alluxio with Spark, including RDDs and DataFrames, as well as with on-premises deployments and public cloud deployments.

Photo of Yupeng Fu

Yupeng Fu


Yupeng Fu is a software engineer at Alluxio and a PMC member of the Alluxio open source project. Previously, Yupeng worked at Palantir, where he led the efforts to build the company’s storage solution. Yupeng holds a BS and an MS from Tsinghua University and has completed coursework toward a PhD at UCSD.



WeChat QRcode


Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2


ORB Data Site