Cassandra is one of the most popular datastores used in big data, real-time streaming, and machine learning applications. As the volume and velocity of data collected rapidly increases, it’s critical that the speed of data processing and analysis stays ahead in order to support today’s big data applications and meet end users’ service-level agreement (SLA) expectations, but the Cassandra storage model makes it difficult and sometimes very inefficient to run analytical queries.
Spark with Cassandra integration mitigates this problem. Spark is a very fast in-memory data processing framework used as an execution engine for analytics workloads. Intuit uses Spark and Cassandra in real-time and ML applications. The writes to Cassandra via Spark scaled very well, but reads (analytical queries) did not. Basic queries (counts) involving IN clause were taking several minutes and sometimes crashing.
Shradha Ambekar shares how her team debugged and solved performance problems at scale by concentrating on two concepts and architecture patterns: predicate push down and distributing workloads across partitions. Spark is very efficient in running analytical queries; however, if predicates are not pushed down to the datastore, it results in a full table scan and disastrous performance. With the Spark-Cassandra connector catalyst optimizer pushing predicates to Cassandra for the IN clause, queries were completed in a few seconds rather than several minutes (~30 minutes for a few TBs of data), resulting in a performance gain of 100×. But even if predicates are pushed down to datastore, performance may still suffer if the workload is not distributed across partitions. The modifications made to the Spark-Cassandra connector to create an optimal number of Spark partitions and distribute workload accordingly resulted in a 5x–10x faster query response in addition to the performance gain achieved by predicate pushdown.
Join in to discover how Shadha’s team at Intuit debugged performance problems, found the root cause, and implemented the fix—along with lessons learned and best practices for running analytic queries on Cassandra.
Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org