Scalable anomaly detection with Spark and SOS
Who is this presentation for?Data scientists, data engineers
In this talk, we present Stochastic Outlier Section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings.
SOS has originally been developed by speaker Jeroen Janssens in Matlab. Later, to allow for a wider adoption by the data science community, the algorithm was ported to both Python and R.
Since 2015, Fokko Driesprong has been implementing SOS on a variety of distributed, large-scale data processing technologies, including Spark’s MLlib and Apache Flink. Most recently, he successfully ported the MLlib implementation to Spark ML Pipelines, because that has superseded MLlib and provides a uniform set of high-level APIs built on top of DataFrames.
First, we illustrate the idea and intuition behind SOS. Subsequently, we demonstrate our implementation of SOS on top of ML Pipelines and discuss the process of porting it from MLlib. Finally, we apply SOS to a real-world use case. By the end of this talk you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.
Prerequisite knowledgeSome knowledge of Spark is helpful, but not required.
What you'll learn
Data Science Workshops B.V.
Jeroen Janssens is the founder and CEO of Data Science Workshops B.V., which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts