Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Breakthrough OLAP performance on Cassandra and Spark

Evan Chan (Tuplejump)
1:30pm–2:10pm Thursday, 12/03/2015
Hadoop & Beyond
Location: 328-329 Level: Intermediate
Average rating: ***..
(3.67, 3 ratings)
Slides:   external link

Prerequisite Knowledge

Working familiarity with Apache Cassandra and Apache Spark, and analytical/BI architectures and databases in general. This is more of an architectural talk so API knowledge is not needed.

Description

Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.

First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We’ll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format.

Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark’s Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations (see Spark’s Project Tungsten).

FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:

  • Easy exactly-once ingestion from Kafka for streaming and IoT applications
  • How FiloDB + the Spark – Kafka – Cassandra stack can power smart cities and big event stream processing.
  • Incremental computed columns and geospatial annotations. We’ll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.
Photo of Evan Chan

Evan Chan

Tuplejump

Evan Chan is a distinguished software engineer at Tuplejump. Evan loves to design, build, and improve bleeding-edge distributed data and backend systems using the latest open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. Evan is an active contributor to the Apache Spark project, a DataStax Cassandra MVP, and cocreator and maintainer of the open source Spark Job Server. He is a big believer in GitHub, open source, and meetups and has given talks at various conferences, including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.