Fueling innovative software
July 15-18, 2019
Portland, OR

Optimizing analytical queries on Cassandra by 100x

Shradha Ambekar (Intuit)
5:05pm5:45pm Wednesday, July 17, 2019
Secondary topics:  Data Driven
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Data engineers

Level

Intermediate

Description

Cassandra is one of the most popular datastores used in big data, real-time streaming, and machine learning applications. As the volume and velocity of data collected rapidly increases, it’s critical that the speed of data processing and analysis stays ahead in order to support today’s big data applications and meet end users’ service-level agreement (SLA) expectations, but the Cassandra storage model makes it difficult and sometimes very inefficient to run analytical queries.

Spark with Cassandra integration mitigates this problem. Spark is a very fast in-memory data processing framework used as an execution engine for analytics workloads. Intuit uses Spark and Cassandra in real-time and ML applications. The writes to Cassandra via Spark scaled very well, but reads (analytical queries) did not. Basic queries (counts) involving IN clause were taking several minutes and sometimes crashing.

Shradha Ambekar shares how her team debugged and solved performance problems at scale by concentrating on two concepts and architecture patterns: predicate push down and distributing workloads across partitions. Spark is very efficient in running analytical queries; however, if predicates are not pushed down to the datastore, it results in a full table scan and disastrous performance. With the Spark-Cassandra connector catalyst optimizer pushing predicates to Cassandra for the IN clause, queries were completed in a few seconds rather than several minutes (~30 minutes for a few TBs of data), resulting in a performance gain of 100×. But even if predicates are pushed down to datastore, performance may still suffer if the workload is not distributed across partitions. The modifications made to the Spark-Cassandra connector to create an optimal number of Spark partitions and distribute workload accordingly resulted in a 5x–10x faster query response in addition to the performance gain achieved by predicate pushdown.

Join in to discover how Shadha’s team at Intuit debugged performance problems, found the root cause, and implemented the fix—along with lessons learned and best practices for running analytic queries on Cassandra.

Prerequisite knowledge

  • General knowledge of Spark and Cassandra

What you'll learn

  • Learn Spark internals, techniques, and tools to debug performance issues when running Spark queries
  • See how similar debugging techniques and performance fixes are applicable to other Spark connectors or any other query execution engine
Photo of Shradha Ambekar

Shradha Ambekar

Intuit

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.