Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Detecting and scoring anomalies with calibrated probabilistic models

Alexander Gray (Skytree, Inc.)
2:00pm–2:30pm Tuesday, 03/29/2016
Hardcore Data Science
Location: 210 C/G
Tags: geospatial
Average rating: ***..
(3.67, 12 ratings)

Prerequisite knowledge

Attendees should have machine-learning or statistical experience (e.g., masters-level coursework in machine learning).

Description

Many real-world problems, including fraud, intrusion, threats, defects, and defect detection, are typically described within the anomaly-detection framework, which usually refers to the process of finding instances in the data that have different characteristics from the vast majority of other instances. The problem is inherently difficult as the proportion of the anomalous or abnormal examples is very small, and the examples have very diverse characteristics as compared to the rest of the (normal) data, making it hard to build a classifier for the supervised setting of the problem, where anomaly labels are available.

In many cases, however, labels are either sparse or nonexistent, requiring an unsupervised learning approach sometimes referred to as a one-class classification. In this case, anomalies can be found by building a probabilistic model for all instances—corresponding to normal behavior—and identifying anomalies as instances which are very unlikely under the model for normal instances. However, as small likelihood values do not necessarily correspond to anomalies (e.g., if the high dimensionality of the feature space makes almost all instances have low likelihood scores), additional calibration steps are needed to transform the likelihood scores into anomaly scores.

Alex Gray describes both the anomaly/outlier/novelty detection setup and commonly used approaches. Alex focuses on unsupervised learning, exploring ways to build unsupervised models (focusing on kernel density estimation) and explaining how to calibrate the likelihood scores. He uses a real-world use case—finding outliers in geospatial behavior—to demonstrate how an outlier-detection framework can be applied to find anomalies in a dataset with millions of instances.

Photo of Alexander Gray

Alexander Gray

Skytree, Inc.

Alexander Gray is an associate professor at Georgia Tech and the CEO of Skytree, Inc. His research focuses on scaling up all of the major practical methods of machine learning (ML) to massive datasets. Alex began working on this problem at NASA in 1993 (long before the current fashionable talk of big data). His large-scale algorithms helped enable the top scientific breakthrough of 2003 and have won a number of research awards.

Alex served on the National Academy of Sciences Committee on the Analysis of Massive Data and frequently gives invited tutorial lectures on massive-scale ML at top research conferences and agencies. Alexander has degrees in applied mathematics and computer science from UC Berkeley and a PhD in computer science from Carnegie Mellon.