Engineer for the future of Cloud
June 10-13, 2019
San Jose, CA

How embracing unreliability can make infrastructure reliable and cost-effective

Osman Sarood (Mist Systems)
1:25pm2:05pm Thursday, June 13, 2019
Building Cloud Native Systems
Location: LL21 A/B
Average rating: ****.
(4.57, 7 ratings)

Level

Intermediate

Prerequisite knowledge

  • A basic understanding of public cloud
  • Familiarity with distributed systems like Mesos and Storm (useful but not required)

What you'll learn

  • Learn how Mist kept its annual cost to $2 million rather than $4 million (i.e., a 50% reduction in cost) using AWS Spot Instances while keeping its infrastructure reliable

Description

Server faults are a reality. While public cloud vendors try to improve hardware and VM level reliability, software should play its part by being resilient to those failures. Mist Systems consumes several terabytes of telemetry data every day coming from its wireless access points (APs) deployed all over the world. A significant portion of the company’s telemetry data is consumed by machine learning algorithms, which are essential for the smooth operation of some of the world’s largest WiFi deployments. Mist applies machine learning to incoming telemetry data to detect and attribute anomalies, which is a nontrivial problem and requires exploring multiple dimensions. Although its infrastructure is small compared to some of the tech giants, it’s growing very rapidly. Last year, the company saw a 10x growth in its infrastructure, taking its AWS annual cost over $2 million.

Most of the company’s anomaly detection and attribution is done in real time. Effectively doing anomaly detection and attribution can require significant resources and can quickly become cost prohibitive. Mist’s data pipeline starts with Kafka, where all incoming telemetry data is buffered. The company has two main real time processing engines: Apache Storm and an in-house real-time time series aggregation framework, Live-aggregators. The Storm topologies host the bulk of Mist’s machine learning algorithms. They consume telemetry data from Kafka, apply domain-specific models on it to estimate metrics like throughput, capacity, and coverage for each WiFi client, and write these numbers back to Kafka. Live-aggregators reads the estimated metrics and aggregates them using different groupings (e.g., per 10 minute average throughput per organization). After aggregating the data, Live-aggregators writes it to Cassandra. Some other topologies consume the aggregated data to detect and attribute anomalies. Mist’s API can then query Cassandra and serve these aggregates or anomalies to the end user.

Osman Sarood offers an overview of Mist’s infrastructure and explains how the company kept its annual cost to $2 million rather than $4 million (i.e., a 50% reduction in cost) using AWS Spot Instances while keeping its infrastructure reliable. Spot Instances are, on average, 85% cheaper than traditional on-demand instances but can terminate anytime with a two-minute warning. Handling such volatility is in general difficult for most real-time applications, especially machine learning applications. Join in to learn how Mist architected its services and real-time topologies to be resilient to server faults and explore its monitoring and alerting strategy—and the critical role they play in ensuring reliability. You’ll see how Mist tracks each spot market’s (instance type and availability zone combination) price and number of instance terminations, which dictates which spot markets are safer to use for data center stability as well as the impact of losing Spot Instances for real-time platforms like Storm versus microservices running on top of Mesos.

Seeing is believing. Osman concludes with a demo of terminating spot instances from Mist’s production Storm and Mesos clusters, (which are completely running on Spot Instances), and illustrating their impact by examining real-time health metrics. He also explains how many Spot Instance terminations the company can endure for each of its Storm and Mesos clusters and the associated overprovisioning required to ensure it always has enough capacity for high availability.

Photo of Osman Sarood

Osman Sarood

Mist Systems

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.