Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Real-time analytics using Kudu at petabyte scale

Sridhar Alla (BlueWhale), Shekhar Agrawal (Comcast)
4:20pm5:00pm Wednesday, March 15, 2017
Real-time applications, Stream processing and analytics
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Platform
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Engineers, architects, and developers

Prerequisite knowledge

  • Familiarity with the Hadoop ecosystem
  • A basic knowledge of MapReduce and HDFS

What you'll learn

  • Learn the practical aspects of the Kudu storage system, using Spark to interact with Kudu to provide fast analytics on huge datasets


Kudu is redefining the big data ecosystem and opening doors to capabilities not previously available. Sridhar Alla and Shekhar Agrawal explain how Comcast has deployed the largest Kudu cluster thus far and is rapidly developing advanced applications to provide real-time analytics at petabyte scale while avoiding the expensive denormalization processes, covering how real-time analytics using Kudu scale much higher than using other NoSQL databases.

Sridhar and Shekhar release the practical implementation details and talk about the extensive benchmarks at 1 trillion-event table sizes. While the Spark platform processes both the historical data and the real-time events streaming through Kafka, the middle tier accesses Kudu tables to generate subsecond real-time dashboards while still having the power of Hadoop to deliver batch analytics and integrations with other platforms. This is key to the success of the platform—previously Comcast had to rely on variety of multitiered architectures to provide fast storage and still be able to update just like NoSQL engines—but without the lag caused by several thousand updates per second.

Photo of Sridhar Alla

Sridhar Alla


Sridhar Alla is cofounder and CTO at BlueWhale, which brings together the worlds of big data and artificial intelligence to provide comprehensive solutions to meet the business needs of organizations of all sizes. He and his team are cloud and tool agnostic and strive to embed themselves into the workstream to provide strategic and technical assistance, with solutions such as predictive modeling and analytics, capacity planning, forecasting, anomaly detection, advanced NLP, chatbot development, SAS to Python migration, and deep learning-based model building and operationalization. Sridhar is also the author of three books and an avid presenter at conferences including Strata, Hadoop World, Spark Summit and others.

Photo of Shekhar Agrawal

Shekhar Agrawal


Shekhar Agrawal is the director of data science at Comcast. Shekhar is an expert data scientist with specialization in the text and NLP fields. He currently handles several PB-scale modeling initiatives to improve customer experience factors.

Comments on this page are now closed.


03/16/2017 12:32am PDT

Hi – is there a video for this session that I can view? Thanks!