Sep 23–26, 2019

Scalable anomaly detection with Spark and SOS

Jeroen Janssens (Data Science Workshops B.V.)
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 08/10

Who is this presentation for?

Data scientists, data engineers




In this talk, we present Stochastic Outlier Section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings.

SOS has originally been developed by speaker Jeroen Janssens in Matlab. Later, to allow for a wider adoption by the data science community, the algorithm was ported to both Python and R.

Since 2015, Fokko Driesprong has been implementing SOS on a variety of distributed, large-scale data processing technologies, including Spark’s MLlib and Apache Flink. Most recently, he successfully ported the MLlib implementation to Spark ML Pipelines, because that has superseded MLlib and provides a uniform set of high-level APIs built on top of DataFrames.

First, we illustrate the idea and intuition behind SOS. Subsequently, we demonstrate our implementation of SOS on top of ML Pipelines and discuss the process of porting it from MLlib. Finally, we apply SOS to a real-world use case. By the end of this talk you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.

Prerequisite knowledge

Some knowledge of Spark is helpful, but not required.

What you'll learn

The attendees will both learn the concepts behind anomaly detection and specifically the SOS algorithm, as well as how to apply it in practice.
Photo of Jeroen Janssens

Jeroen Janssens

Data Science Workshops B.V.

Jeroen Janssens is the founder and CEO of Data Science Workshops B.V., which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    Contact list

    View a complete list of Strata Data Conference contacts