O'Reilly、Cloudera 主办
Make Data Work

OAP: 使用Spark SQL进行即席查询 (OAP: Using Spark SQL for ad hoc queries)

此演讲使用中文 (This will be presented in Chinese)

Daoyuan Wang (Intel), 李元健 (百度)
13:10–13:50 Friday, 2017-07-14
Spark及更多发展 (Spark & beyond)
地点: 紫金大厅B(Grand Hall B) 观众水平 (Level): 中级 (Intermediate)
平均得分:: ***..
(3.00, 1 次得分)

必要预备知识 (Prerequisite Knowledge)

观众需要了解Spark SQL或了解数据查询相关知识,最好能够了解Spark SQL的data source API。

您将学到什么 (What you'll learn)

使用分布式索引对大规模分布式数据查询进行优化的思想,利用Spark提供的data source API对Spark进行扩展的设计思想与实现能力。

描述 (Description)

在数据仓库中用Spark SQL进行批量查询,已经是工业界较为常见的做法,然而尽管Spark SQL已经能支持对丰富的数据源进行高效的数据处理,但对于秒级的查询需求,Spark SQL尚有不足,而很多企业对此也有很大需求。我们基于Spark SQL开发的项目OAP,正是为了满足秒级甚至更高要求的即席查询需求。

OAP以Fiber为基本单位提供了一套细粒度的分层缓存机制,将数据缓存在堆外内存中,可以有效加速数据的加载。同时,OAP拓展了Spark SQL的DDL,允许用户自定义索引,目前支持B+树索引和布隆过滤器,可以让用户根据数据特点定义高效的索引,进一步减少IO操作,提升查询效率。OAP运行时与Spark SQL共享同一个进程,不会引入额外的维护成本。

2016年,Intel与百度合作的OAP平台首个版本在百度内部开放使用,帮助多个核心产品团队从过去低效的批量作业查询方式升级至即席查询模式。在百度的凤巢广告系统中,数据工程师基于每日数T的点击、展现日志进行广告效果分析,OAP将查询性能提升至原生Spark SQL的5倍,尤其在复杂查询及大数据量分析的场景下将平均延迟从分钟级降低至秒级,同时仅增加3%的索引数据消耗。

It is already a common practice in the industry to use Spark SQL for batch queries in a data warehouse. Although Spark SQL has been able to support rich data sources for efficient data processing, for second-level query needs, Spark SQL lacks the capabilities most enterprises demand. Daoyuan Wang and 李元健 offer an overview of OAP, a project based on Spark SQL that meets the second-level (or even shorter) needs for ad hoc queries.

OAP provides a fine-grained hierarchical caching mechanism based on Fiber, which caches data in off-heap memory, effectively speeding data loading. At the same time, OAP expands Spark SQL’s DDL, allowing users to customize the index, which currently supports B + tree index and Bloom filter, and enables users to define efficient indexing based on data features, to further reduce the I/O operations and improve query efficiency. OAP shares the same process with Spark SQL and does not introduce additional maintenance costs.

In 2016, Intel collaborated with Baidu to introduce the first version of OAP platform in Baidu for its internal use to help multiple core product teams upgrade from the past inefficient batch query model to an ad hoc query model. In Baidu’s Phoenix Hive advertising system, based on a trillion daily clicks and ad effectiveness analysis. OAP raised query performance 5x compared with native Spark SQL. Especially in complex query and big data scenarios, it reduces the average delay from minutes to seconds with only a 3% increase in index data consumption.

Photo of Daoyuan Wang

Daoyuan Wang


王道远,英特尔亚太研发有限公司资深软件研发工程师,Apache Spark社区的活跃贡献者,自2014年起参与Spark SQL开发。在此之前,曾参与IDH版本Hive的开发工作。译有《Spark快速大数据分析》一书。

Photo of 李元健



李元健,百度基础架构部资深研发工程师,Apache Spark contributor。11年加入百度,先后参与并负责百度实时计算平台DStream,Tracing平台Rig,Spark平台及公有云BigSQL等核心服务的研发工作。



WeChat QRcode


Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2


ORB Data Site