San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

11:00am–11:40am Wednesday, March 27, 2019

Data Engineering & Architecture
Location: 2006

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Average rating:

(4.67, 3 ratings)

View slides

Who is this presentation for?

Data scientists, distributed systems engineers, site reliability engineers, and directors of engineering

Level

Intermediate

Prerequisite knowledge

A basic understanding of the public cloud
Familiarity with distributed systems, such as Mesos, Kafka, and Cassandra (useful but not required)

What you'll learn

Understand considerations for designing real-time applications that can autoscale for seasonal changes in load and achieve service-wide CPU utilizations of over 75%
Learn how Mist reliably maintains over 1 TB application state amid high server faults by checkpointing in AWS S3 and uses multilevel aggregation to solve Kafka topic imbalance issues
Discover how Mist identified key metrics that served as inputs for its autoscaling engine
Hear lessons learned from building a highly scalable and reliable real-time aggregation system

Description

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load, ensuring fault tolerance and scalability. LA consumes billions of messages a day from Kafka with a memory footprint of over 1 TB and aggregates over 100 million time series. Since it runs entirely on top of AWS Spot Instances, it’s highly reliable. LA can recover from hours-long EC2 outages using its checkpointing mechanism, which recovers the checkpoint from S3 and replays messages from Kafka where it left off, ensuring no data loss. LA writes the aggregated data to the configured storage system (either be Cassandra, S3 or Kafka). LA does over 1.5 billion writes to Cassandra per day and maintains over 100 million concurrent state machines.

The characteristic that sets LA apart is its ability to autoscale by intelligently learning about resource usage (both seasonal and long-term trends) and allocating resources accordingly. LA emits custom metrics that track resource usage for different components, such as Kafka consumers and shared memory managers and aggregators, to achieve server utilization of over 70%. LA is horizontally scalable to thousands of cores.

Mist also does multilevel aggregations in LA to intelligently solve load imbalance issues among different partitions for a Kafka topic. Osman and Chunky demonstrate multilevel aggregation using an example that aggregates indoor location data coming from different organizations both spatially and temporally. You’ll learn how changing partitioning keys and writing intermediate data back to Kafka in a new topic for the next level aggregators help Mist scale its solution.

Osman Sarood

Mist Systems

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.

Website

Chunky Gupta

Mist Systems

Chunky Gupta is a member of the technical staff at Mist Systems, where he works on scaling the company’s cloud infrastructure. Previously, he was a software engineer at Yelp, where he developed an autoscaling engine, FleetMiser, to intelligently autoscale Yelp’s Mesos cluster, saving millions of dollars. He also scaled Yelp’s in-house distributed and reliable task runner, Seagull (which he wrote about for the Yelp engineering blog). Before that, he built a Hadoop-based data warehouse system at Vizury. He gave a talk on FleetMiser at re:Invent 2016. Chunky holds an MS in computer science from Texas A&M University.

Website

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com