Get the free Ebook:
Private and Open Data in Asia: A Regional Guide.
Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.
First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We’ll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format.
Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark’s Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations (see Spark’s Project Tungsten).
FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:
Evan Chan is a distinguished software engineer at Tuplejump. Evan loves to design, build, and improve bleeding-edge distributed data and backend systems using the latest open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. Evan is an active contributor to the Apache Spark project, a DataStax Cassandra MVP, and cocreator and maintainer of the open source Spark Job Server. He is a big believer in GitHub, open source, and meetups and has given talks at various conferences, including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.