Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Fast big data analytics with Spark on Tachyon in Baidu

Bin Fan (Alluxio), Xiang Wen (Baidu)
11:00am–11:40am Wednesday, 12/02/2015
Hadoop & Beyond
Location: 328-329 Level: Intermediate
Average rating: ***..
(3.22, 9 ratings)
Slides:   1-PPTX 

Prerequisite Knowledge

None. But better to know Tachyon and Spark.


In this talk we will focus on how Tachyon can help improve big data analytics (ad-hoc query) efficiency within Baidu.

Currently within Baidu, we have a production Tachyon cluster with 100 nodes and over 2PB of storage space – this cluster mainly serves as the cache layer for our big data analytics engine. In this talk, first we introduce the big data analytic infrastructure within Baidu. Then, we explain why we started using Tachyon a few months ago, as well as the problems encountered when we started using Tachyon. Next, we delve into the details of how Tachyon help accelerate our Big big data analytics pipeline at its current state. At the end, we discuss what new features we want to see and the plan to scale further.

Photo of Bin Fan

Bin Fan


Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google, building next-generation storage infrastructure, where he won Google’s technical infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.

Xiang Wen


Currently work at Baidu Big Data Group, focusing on big data infrastructure

Comments on this page are now closed.


Cuong Nguyen
12/02/2015 9:55pm +08

Hi Bin Fan, could you please send me the slide of the session? Thank you

Picture of Haoyuan Li
Haoyuan Li
08/31/2015 12:51pm +08

Tian Lang,

Thanks for the interest. This sounds great. In the meantime, please feel free to email me or post the question at Tachyon user mailing list, where you should get faster response.

Tian Lang
08/26/2015 5:34am +08

我们在使用Streaming把数据保存到hdfs时发现使用DataFrame.insertToTable 每次操作都会产生多个小文件,我们想要一个组件可以实时插入数据,数据在一定时间后自动同步到hdfs,并且可以直接用hql查询内存中的数据和hdfs中的数据 tachyon是否合适?