Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

How to cost-effectively and reliably build infrastructure for machine learning

Osman Sarood (Mist Systems)
4:35pm–5:15pm Wednesday, 09/12/2018
Data engineering and architecture
Location: 1A 21/22 Level: Beginner
Secondary topics:  Data Platforms
Average rating: **...
(2.00, 1 rating)

Who is this presentation for?

  • Data scientists, distributed systems engineers, site reliability engineers, and directors of engineering

Prerequisite knowledge

  • A basic understanding of the public cloud
  • Familiarity with distributed systems like Mesos and Storm (useful but not required)

What you'll learn

  • Learn how Mist Systems does machine learning using AWS spot instances, saving more than $2 million a year in compute cost compared to on-demand EC2 instances
  • Understand how to select the right EC2 instance types, how much overprovisioning is needed to ensure reliability, and the impact of different types of applications


Mist Systems consumes several terabytes of telemetry data every day from its wireless access points (APs) deployed all over the world. A significant portion of this telemetry data is consumed by machine learning algorithms, which are essential for the smooth operation of some of the world’s largest WiFi deployments. Mist applies machine learning to incoming telemetry data to detect and attribute anomalies, which is a nontrivial problem and requires exploring multiple dimensions. Although the infrastructure is small compared to some of the tech giants, it’s growing very rapidly.

Most of Mist’s anomaly detection and attribution is done in real time. Effectively doing anomaly detection and attribution can require significant resources and can quickly become cost prohibitive. Mist’s data pipeline starts with Kafka, where all incoming telemetry data is buffered. The company has two main real-time processing engines: Apache Storm and an in-house real-time time series aggregation framework, Live-aggregators. Mist’s Storm topologies host the bulk of its machine learning algorithms, which consume telemetry data from Kafka, apply domain-specific models on it to estimate metrics like throughput, capacity, and coverage for each WiFi client, and write these numbers back to Kafka. Live-aggregators reads the estimated metrics and aggregates them using different groupings (e.g., per 10 minute average throughput per organization). After aggregating the data, Live-aggregators writes it to Cassandra. Some other topologies consume the aggregated data to detect and attribute anomalies. The API can then query Cassandra and serve these aggregates or anomalies to the end user.

Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million—a 66% reduction in our AWS cost. Spot instances are on average 80% cheaper than traditional on-demand instances but can terminate anytime with a two-minute warning. Handling such volatility is in general difficult for most real-time applications, especially machine learning applications. Osman also covers the monitoring and alerting strategy for Mist’s applications and explains why they are a critical part in ensuring reliability. He also shares his experience using Amazon’s spot fleet and explains how Mist identified which EC2 instance types (memory intensive versus compute intensive) to use, given that various instance types have different spot price profiles and there is a possibility of getting outbid and compromising cluster stability. You’ll also discover the impact of losing spot instances for real-time platforms like Storm versus microservices running on top of Mesos.

Seeing is believing: Osman concludes with a demo of terminating spot instances from Mist’s production Storm and Mesos clusters, which are completely running on spot instances, and illustrates their impact by examining real-time health metrics. He also details how many spot instance terminations Mist can endure for each of its Storm and Mesos clusters and the associated overprovisioning required to ensure the company always has enough capacity for high availability.

Photo of Osman Sarood

Osman Sarood

Mist Systems

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.