Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load, ensuring fault tolerance and scalability. LA consumes billions of messages a day from Kafka with a memory footprint of over 1 TB and aggregates over 100 million time series. Since it runs entirely on top of AWS Spot Instances, it’s highly reliable. LA can recover from hours-long EC2 outages using its checkpointing mechanism, which recovers the checkpoint from S3 and replays messages from Kafka where it left off, ensuring no data loss. LA writes the aggregated data to the configured storage system (either be Cassandra, S3 or Kafka). LA does over 1.5 billion writes to Cassandra per day and maintains over 100 million concurrent state machines.
The characteristic that sets LA apart is its ability to autoscale by intelligently learning about resource usage (both seasonal and long-term trends) and allocating resources accordingly. LA emits custom metrics that track resource usage for different components, such as Kafka consumers and shared memory managers and aggregators, to achieve server utilization of over 70%. LA is horizontally scalable to thousands of cores.
Mist also does multilevel aggregations in LA to intelligently solve load imbalance issues among different partitions for a Kafka topic. Osman and Chunky demonstrate multilevel aggregation using an example that aggregates indoor location data coming from different organizations both spatially and temporally. You’ll learn how changing partitioning keys and writing intermediate data back to Kafka in a new topic for the next level aggregators help Mist scale its solution.
Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.
Chunky Gupta is a member of the technical staff at Mist Systems, where he works on scaling the company’s cloud infrastructure. Previously, he was a software engineer at Yelp, where he developed an autoscaling engine, FleetMiser, to intelligently autoscale Yelp’s Mesos cluster, saving millions of dollars. He also scaled Yelp’s in-house distributed and reliable task runner, Seagull (which he wrote about for the Yelp engineering blog). Before that, he built a Hadoop-based data warehouse system at Vizury. He gave a talk on FleetMiser at re:Invent 2016. Chunky holds an MS in computer science from Texas A&M University.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com