Many real-world problems, including fraud, intrusion, threats, defects, and defect detection, are typically described within the anomaly-detection framework, which usually refers to the process of finding instances in the data that have different characteristics from the vast majority of other instances. The problem is inherently difficult as the proportion of the anomalous or abnormal examples is very small, and the examples have very diverse characteristics as compared to the rest of the (normal) data, making it hard to build a classifier for the supervised setting of the problem, where anomaly labels are available.
In many cases, however, labels are either sparse or nonexistent, requiring an unsupervised learning approach sometimes referred to as a one-class classification. In this case, anomalies can be found by building a probabilistic model for all instances—corresponding to normal behavior—and identifying anomalies as instances which are very unlikely under the model for normal instances. However, as small likelihood values do not necessarily correspond to anomalies (e.g., if the high dimensionality of the feature space makes almost all instances have low likelihood scores), additional calibration steps are needed to transform the likelihood scores into anomaly scores.
Alex Gray describes both the anomaly/outlier/novelty detection setup and commonly used approaches. Alex focuses on unsupervised learning, exploring ways to build unsupervised models (focusing on kernel density estimation) and explaining how to calibrate the likelihood scores. He uses a real-world use case—finding outliers in geospatial behavior—to demonstrate how an outlier-detection framework can be applied to find anomalies in a dataset with millions of instances.
Alexander Gray is an associate professor at Georgia Tech and the CEO of Skytree, Inc. His research focuses on scaling up all of the major practical methods of machine learning (ML) to massive datasets. Alex began working on this problem at NASA in 1993 (long before the current fashionable talk of big data). His large-scale algorithms helped enable the top scientific breakthrough of 2003 and have won a number of research awards.
Alex served on the National Academy of Sciences Committee on the Analysis of Massive Data and frequently gives invited tutorial lectures on massive-scale ML at top research conferences and agencies. Alexander has degrees in applied mathematics and computer science from UC Berkeley and a PhD in computer science from Carnegie Mellon.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.