Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Introducing Hive's new execution engine - Spark

Xuefu Zhang (Cloudera), Chengxiang Li (Intel)
11:30am–12:10pm Thursday, 02/19/2015
Spark in Action
Location: 210 C/G
Average rating: **...
(2.50, 10 ratings)

Apache Hive has become de facto standard SQL on big data in Hadoop ecosystem. With its open architecture and backend neutrality, Hive queries can run on MapReduce and Tez. On the other hand, Apache Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Marrying the two, that is, providing a new execution engine to Hive, has many benefits for Spark users and Hive users. This presentation will talk about the motivation, design principles, architecture, etc. followed by a demo.

Photo of Xuefu Zhang

Xuefu Zhang

Cloudera

Xuefu Zhang has over 10 year’s experience in software development. Working for Cloudera since May 2013, he spends a lot of his efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on Hadoop was still there. Xuefu Zhang is currently a PMC member for Hive and Sentry, and a committer for Pig project.

Photo of Chengxiang Li

Chengxiang Li

Intel

ChengXiang Li is a software engineer from Intel SSG Big Data Technology team, he is dedicated to enable and improve SQL interfaces in Hadoop ecosystem, and optimize SQL engine performance with IA technologies. Before join Intel, he worked as software engineer in several companies, participated into projects include MPP-like SQL engine, and distributed index server.

Comments on this page are now closed.

Comments

Picture of Xuefu Zhang
Xuefu Zhang
02/19/2015 7:47am PST

The faster second time isn’t due to data caching, but to the savings on cluster startup. There is no issue of stale data in this case.

Arthur Yeo
02/19/2015 6:55am PST

Once a query is re-run the 2nd time, the turnaround time is quite impressive.
Now if the data in the related TABLEs are refreshed, would it know that data has changed and re-fresh the data from HDFS?
In other words, will it know that the cached data is stale?