Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
11:00am11:40am Wednesday, March 27, 2019
Average rating: ****.
(4.67, 3 ratings)

Who is this presentation for?

  • Data scientists, distributed systems engineers, site reliability engineers, and directors of engineering



Prerequisite knowledge

  • A basic understanding of the public cloud
  • Familiarity with distributed systems, such as Mesos, Kafka, and Cassandra (useful but not required)

What you'll learn

  • Understand considerations for designing real-time applications that can autoscale for seasonal changes in load and achieve service-wide CPU utilizations of over 75%
  • Learn how Mist reliably maintains over 1 TB application state amid high server faults by checkpointing in AWS S3 and uses multilevel aggregation to solve Kafka topic imbalance issues
  • Discover how Mist identified key metrics that served as inputs for its autoscaling engine
  • Hear lessons learned from building a highly scalable and reliable real-time aggregation system


Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load, ensuring fault tolerance and scalability. LA consumes billions of messages a day from Kafka with a memory footprint of over 1 TB and aggregates over 100 million time series. Since it runs entirely on top of AWS Spot Instances, it’s highly reliable. LA can recover from hours-long EC2 outages using its checkpointing mechanism, which recovers the checkpoint from S3 and replays messages from Kafka where it left off, ensuring no data loss. LA writes the aggregated data to the configured storage system (either be Cassandra, S3 or Kafka). LA does over 1.5 billion writes to Cassandra per day and maintains over 100 million concurrent state machines.

The characteristic that sets LA apart is its ability to autoscale by intelligently learning about resource usage (both seasonal and long-term trends) and allocating resources accordingly. LA emits custom metrics that track resource usage for different components, such as Kafka consumers and shared memory managers and aggregators, to achieve server utilization of over 70%. LA is horizontally scalable to thousands of cores.

Mist also does multilevel aggregations in LA to intelligently solve load imbalance issues among different partitions for a Kafka topic. Osman and Chunky demonstrate multilevel aggregation using an example that aggregates indoor location data coming from different organizations both spatially and temporally. You’ll learn how changing partitioning keys and writing intermediate data back to Kafka in a new topic for the next level aggregators help Mist scale its solution.

Photo of Osman Sarood

Osman Sarood

Mist Systems

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.

Photo of Chunky Gupta

Chunky Gupta

Mist Systems

Chunky Gupta is a member of the technical staff at Mist Systems, where he works on scaling the company’s cloud infrastructure. Previously, he was a software engineer at Yelp, where he developed an autoscaling engine, FleetMiser, to intelligently autoscale Yelp’s Mesos cluster, saving millions of dollars. He also scaled Yelp’s in-house distributed and reliable task runner, Seagull (which he wrote about for the Yelp engineering blog). Before that, he built a Hadoop-based data warehouse system at Vizury. He gave a talk on FleetMiser at re:Invent 2016. Chunky holds an MS in computer science from Texas A&M University.