Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA
Please log in

Masquerading malicious DNS traffic

David Rodriguez (Cisco Systems)
11:50am12:30pm Thursday, March 28, 2019
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Machine learning engineers, data scientists, and statisticians



Prerequisite knowledge

  • A basic understanding of Apache Spark, statistics, and probability

What you'll learn

  • Learn how Cisco uses Apache Spark and Stripe’s Bayesian inference software, Rainier, to fit the underlying time series distribution for millions of domains and outlines techniques to identify artificial traffic volumes related to spam, malvertising, and botnets (masquerading traffic)


Masquerading traffic is artificially generated traffic mixed within normal traffic. Detecting this behavior change is often difficult because of the random behavior of network traffic, causing most unsupervised and supervised statistical modeling to fail.

David Rodriguez explains how Cisco performs large-scale Bayesian inference on DNS logs to uncover masquerading traffic in count data, representing the number of requests from tens of millions of stub IPs made to hundreds of millions of domains. Using novel mixtures of common discrete distributions, or hidden Markov processes, the company models some of the most sporadic network traffic volumes to domain names. From zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) distributions and their more generalized forms, it models the gaps in requests as if they were just as important as the requests themselves, teasing out underlying changes in request patterns.

The company then combines Apache Spark and Stripe’s Rainier to distribute and perform Bayesian modeling, running thousands of simulations (using MCMC methods), to fit the underlying requester patterns. David demonstrates how the parameters to these models offer insights into changes that aren’t easily discerned by eye. Only with hundreds of thousands of simulated and archived traffic patterns associated with benign and malicious network traffic can you begin to unravel how to reduce false alarms and effectively monitor evolving online threats and masquerading malicious traffic.

Topics include:

  • The latest advances in Bayesian inference on the JVM using Stripe’s open source Rainier project
  • How to scale Bayesian inference to internet-scale datasets using Apache Spark
  • How to build time-dependent risk and severity metrics identifying network anomalies
  • How to model sporadic network traffic using discrete probability distributions
  • How to build hidden Markov models (HMMs) capturing idle and active states of network traffic
  • How to use Markov chain Monte Carlo (MCMC) methods
Photo of David Rodriguez

David Rodriguez

Cisco Systems

David Rodriguez is a senior research engineer at Cisco Umbrella (formerly OpenDNS). He has coauthored multiple pending patents with Cisco in distributed machine learning applications centered around deep learning and behavioral analytics. He’s a frequent speaker about machine learning in cybersecurity at conferences including Flink Forward, Black Hat, Flocon, Virus Bulletin, and HitBSEC. David holds an MA in mathematics from San Francisco State University.