Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

Yang Li (Kyligence)
11:50am12:30pm Thursday, March 16, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Big data engineers and analysts

Prerequisite knowledge

  • A basic understanding of Hadoop and Apache Kylin

What you'll learn

  • Get to know the latest features of Apache Kylin v2.0, including real-time OLAP to the latest second, snowflake schema support, richer SQL features, and subsecond query latency

Description

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Yang explores the latest features of Apache Kylin v2.0 and introduces the technical thinking and designs behind them.

Apache Kylin used to support star schema only, which is quite a limitation for many real-world cases. In v2.0, by supporting snowflake schema directly, users can import arbitrary E-R model into Kylin, supporting the most comprehensive data model out of the box—a big step forward for business deployments.

A new cubing engine based on Spark is introduced in v2.0. This is a long wanted feature by many. Implementing the same layered cubing algorithm, the Spark engine is about 2 times faster than the old MapReduce engine as experiment shows.

Since v1.6, Apache Kylin has been able to support microbatch data loading from Kafka and enable minutes latency with near-real-time analysis. A demo will show how twitter messages are analyzed in real-time.

And as always, Apache Kylin focuses on replacing online calculation with offline precalculation, making it quite different from other SQL-on-Hadoop solutions. With the ever-growing volume of data, precalculation (and Apache Kylin) may be the only way out to ensure a constant query response time on big data.

Photo of Yang Li

Yang Li

Kyligence

Yang Li is cofounder and CTO of Kyligence as well as a cocreator and PMC member of Apache Kylin. As the tech lead and architect of Kylin, Yang focuses on big data analysis, parallel computation, data indexing, relational algebra, approximation algorithms, and other technologies. Previously, he was senior architect of eBay’s Analytic Data Infrastructure department; tech lead of IBM’s InfoSphere BigInsights, where he was responsible for the Hadoop open source platform and winner of Outstanding Technical Achievement award; and a vice president at Morgan Stanley, responsible for the global regulatory reporting platform.