Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Fast big data analytics with Spark on Tachyon in Baidu

Bin Fan (Alluxio), Xiang Wen (Baidu)
11:00am–11:40am Wednesday, 12/02/2015
Hadoop & Beyond
Location: 328-329 Level: Intermediate
Average rating: ***..
(3.22, 9 ratings)
Slides:   1-PPTX 

Prerequisite Knowledge

None. But better to know Tachyon and Spark.

Description

In this talk we will focus on how Tachyon can help improve big data analytics (ad-hoc query) efficiency within Baidu.

Currently within Baidu, we have a production Tachyon cluster with 100 nodes and over 2PB of storage space – this cluster mainly serves as the cache layer for our big data analytics engine. In this talk, first we introduce the big data analytic infrastructure within Baidu. Then, we explain why we started using Tachyon a few months ago, as well as the problems encountered when we started using Tachyon. Next, we delve into the details of how Tachyon help accelerate our Big big data analytics pipeline at its current state. At the end, we discuss what new features we want to see and the plan to scale further.

Photo of Bin Fan

Bin Fan

Alluxio

Bin Fan is a software engineer at Alluxio. Bin is one of the top committers on the Alluxio project. Prior to Alluxio, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. Bin has a PhD in computer science from Carnegie Mellon University.

Xiang Wen

Baidu

Currently work at Baidu Big Data Group, focusing on big data infrastructure

Comments on this page are now closed.

Comments

Cuong Nguyen
12/02/2015 1:55pm SGT

Hi Bin Fan, could you please send me the slide of the session? Thank you

Picture of Haoyuan Li
Haoyuan Li
08/31/2015 4:51am SGT

Tian Lang,

Thanks for the interest. This sounds great. In the meantime, please feel free to email me or post the question at Tachyon user mailing list, where you should get faster response.

Tian Lang
08/25/2015 9:34pm SGT

我们在使用Streaming把数据保存到hdfs时发现使用DataFrame.insertToTable 每次操作都会产生多个小文件,我们想要一个组件可以实时插入数据,数据在一定时间后自动同步到hdfs,并且可以直接用hql查询内存中的数据和hdfs中的数据 tachyon是否合适?