Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

A Gentle Introduction to Apache Spark and Clustering for Anomaly Detection

Sean Owen (Cloudera)
5:05pm–5:45pm Thursday, 10/16/2014
Hadoop & Beyond
Location: 1 E20/1 E21
Average rating: ****.
(4.73, 11 ratings)

There has been an explosion of interest in Apache Spark as a new, alternative computing paradigm for Hadoop. It offers something to interest data scientists of all stripes, from interactive REPL to distributed functional programming to implementations of standard machine learning techniques.

In fact, it promises big scalability improvements over MapReduce for iterative algorithms, like k-means clustering, which can be used to detect anomalous data in a huge data set, for example.

This session will walk through a complete example of anomaly detection using Apache Spark and it’s MLlib subproject, as applied to the well-known network intrusion detection data set from KDD Cup ‘99. It will impart a taste of Scala (Spark’s native language), Spark’s core concepts like RDDs, and usage of MLlib for k-means clustering, in real-time on a Hadoop cluster. It will also introduce the concept of k-means clustering and how a data scientist would iteratively improve clustering in a session with Spark.

No prior knowledge of these subjects is required, although the session is intended for a curious technical audience.

Photo of Sean Owen

Sean Owen


Sean is Director of Data Science for EMEA at Cloudera, helping customers build large-scale machine learning solutions on Hadoop. Previously, Sean founded Myrrix Ltd, producing a real-time recommender and clustering product evolved from Mahout. Myrrix is now part of Cloudera. Sean was primary author of recommender components in Apache Mahout, and has been an active committer and PMC member for the project. He is co-author of Mahout in Action.