Fueling innovative software

July 15-18, 2019
Portland, OR

Add to Your Schedule

Optimizing analytical queries on Cassandra by 100x

Shradha Ambekar (Intuit)

5:05pm–5:45pm Wednesday, July 17, 2019

Building Data-Intensive Applications
Location: D138-140

Secondary topics: Data Driven

Average rating:

(4.50, 2 ratings)

Download slides (PDF)

Who is this presentation for?

Data engineers

Level

Intermediate

Description

Cassandra is one of the most popular datastores used in big data, real-time streaming, and machine learning applications. As the volume and velocity of data collected rapidly increases, it’s critical that the speed of data processing and analysis stays ahead in order to support today’s big data applications and meet end users’ service-level agreement (SLA) expectations, but the Cassandra storage model makes it difficult and sometimes very inefficient to run analytical queries.

Spark with Cassandra integration mitigates this problem. Spark is a very fast in-memory data processing framework used as an execution engine for analytics workloads. Intuit uses Spark and Cassandra in real-time and ML applications. The writes to Cassandra via Spark scaled very well, but reads (analytical queries) did not. Basic queries (counts) involving IN clause were taking several minutes and sometimes crashing.

Shradha Ambekar shares how her team debugged and solved performance problems at scale by concentrating on two concepts and architecture patterns: predicate push down and distributing workloads across partitions. Spark is very efficient in running analytical queries; however, if predicates are not pushed down to the datastore, it results in a full table scan and disastrous performance. With the Spark-Cassandra connector catalyst optimizer pushing predicates to Cassandra for the IN clause, queries were completed in a few seconds rather than several minutes (~30 minutes for a few TBs of data), resulting in a performance gain of 100×. But even if predicates are pushed down to datastore, performance may still suffer if the workload is not distributed across partitions. The modifications made to the Spark-Cassandra connector to create an optimal number of Spark partitions and distribute workload accordingly resulted in a 5x–10x faster query response in addition to the performance gain achieved by predicate pushdown.

Join in to discover how Shadha’s team at Intuit debugged performance problems, found the root cause, and implemented the fix—along with lessons learned and best practices for running analytic queries on Cassandra.

Prerequisite knowledge

General knowledge of Spark and Cassandra

What you'll learn

Learn Spark internals, techniques, and tools to debug performance issues when running Spark queries
See how similar debugging techniques and performance fixes are applicable to other Spark connectors or any other query execution engine

Shradha Ambekar

Intuit

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.

Premier Diamond Sponsor

Diamond Sponsors

Platinum Sponsor

Gold Sponsors

Silver Sponsors

Supporting Sponsors

Premier Exhibitors

Exhibitors

Innovators

Non-Profit Exhibitors

Diversity and Inclusion Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email oscon@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of OSCON contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com