Sep 23–26, 2019

Scalable anomaly detection with Spark and SOS

Jeroen Janssens (Data Science Workshops)
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 08/10

Who is this presentation for?

  • Data scientists and data engineers

Level

Intermediate

Description

Jeroen Janssens dives into SOS, an unsupervised algorithm for detecting anomalies in large, high-dimensional data, that he developed in MATLAB. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings. SOS was ported to both Python and R to allow for a wider adoption by the data science community.

SOS has been implemented on a variety of distributed, large-scale data processing technologies, including Spark MLlib and Apache Flink; most recently the MLlib implementation was ported to Spark ML pipelines, because that’s superseded MLlib and provides a uniform set of high-level APIs built on top of dataframes.

Jeroen illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of ML pipelines, explains the process of porting it from MLlib, and applies SOS to a real-world use case. By the end, you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.

Prerequisite knowledge

  • General knowledge of Spark (useful but not required)

What you'll learn

  • Learn the concepts behind anomaly detection and the SOS algorithm as well as how to apply it in practice
Photo of Jeroen Janssens

Jeroen Janssens

Data Science Workshops

Jeroen Janssens is the founder, CEO, and an instructor of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He’s the author of Data Science at the Command Line (O’Reilly). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires