Scalable anomaly detection with Spark and SOS





Who is this presentation for?
- Data scientists and data engineers
Level
Description
Jeroen Janssens dives into SOS, an unsupervised algorithm for detecting anomalies in large, high-dimensional data, that he developed in MATLAB. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings. SOS was ported to both Python and R to allow for a wider adoption by the data science community.
SOS has been implemented on a variety of distributed, large-scale data processing technologies, including Spark MLlib and Apache Flink; most recently the MLlib implementation was ported to Spark ML pipelines, because that’s superseded MLlib and provides a uniform set of high-level APIs built on top of dataframes.
Jeroen illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of ML pipelines, explains the process of porting it from MLlib, and applies SOS to a real-world use case. By the end, you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.
Prerequisite knowledge
- General knowledge of Spark (useful but not required)
What you'll learn
- Learn the concepts behind anomaly detection and the SOS algorithm as well as how to apply it in practice

Jeroen Janssens
Data Science Workshops
Jeroen Janssens is the founder, CEO, and an instructor of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He’s the author of Data Science at the Command Line (O’Reilly). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.
Comments on this page are now closed.
Presented by
Elite Sponsors
Strategic Sponsors
Zettabyte Sponsors
Contributing Sponsors
Exabyte Sponsors
Content Sponsor
Impact Sponsors
Supporting Sponsor
Non Profit
Contact us
confreg@oreilly.com
For conference registration information and customer service
partners@oreilly.com
For more information on community discounts and trade opportunities with O’Reilly conferences
strataconf@oreilly.com
For information on exhibiting or sponsoring a conference
pr@oreilly.com
For media/analyst press inquires
Comments
Can you please post your slides?