Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Yang explores the latest features of Apache Kylin v2.0 and introduces the technical thinking and designs behind them.
Apache Kylin used to support star schema only, which is quite a limitation for many real-world cases. In v2.0, by supporting snowflake schema directly, users can import arbitrary E-R model into Kylin, supporting the most comprehensive data model out of the box—a big step forward for business deployments.
A new cubing engine based on Spark is introduced in v2.0. This is a long wanted feature by many. Implementing the same layered cubing algorithm, the Spark engine is about 2 times faster than the old MapReduce engine as experiment shows.
Since v1.6, Apache Kylin has been able to support microbatch data loading from Kafka and enable minutes latency with near-real-time analysis. A demo will show how twitter messages are analyzed in real-time.
And as always, Apache Kylin focuses on replacing online calculation with offline precalculation, making it quite different from other SQL-on-Hadoop solutions. With the ever-growing volume of data, precalculation (and Apache Kylin) may be the only way out to ensure a constant query response time on big data.
Yang Li is cofounder and CTO of Kyligence as well as a cocreator and PMC member of Apache Kylin. As the tech lead and architect of Kylin, Yang focuses on big data analysis, parallel computation, data indexing, relational algebra, approximation algorithms, and other technologies. Previously, he was senior architect of eBay’s Analytic Data Infrastructure department; tech lead of IBM’s InfoSphere BigInsights, where he was responsible for the Hadoop open source platform and winner of Outstanding Technical Achievement award; and a vice president at Morgan Stanley, responsible for the global regulatory reporting platform.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.