A Gentle Introduction to Apache Spark and Clustering for Anomaly Detection

Sean Owen (Cloudera)
Data Science
Location: 113
Average rating: ****.
(4.00, 20 ratings)
Slides:   external link

There has been an explosion of interest in Apache Spark as a new, alternative computing paradigm for Hadoop. It offers something to interest data scientists of all stripes, from interactive REPL, to distributed functional programming, to implementations of standard machine learning techniques.

In fact, it promises big scalability improvements over MapReduce for iterative algorithms, like k-means clustering, which can be used to detect anomalous data in a huge data set, for example.

This session will walk through a complete example of anomaly detection using Apache Spark and it’s MLlib subproject, as applied to the well-known network intrusion detection data set from KDD Cup ‘99. It will impart a taste of Scala (Spark’s native language), Spark’s core concepts like RDDs, and usage of MLlib for k-means clustering, in real-time on a Hadoop cluster. It will also introduce the concept of k-means clustering and how a data scientist would iteratively improve clustering in a session with Spark.

Photo of Sean Owen

Sean Owen

Cloudera

Sean is Director of Data Science for EMEA at Cloudera, helping customers build large-scale machine learning solutions on Hadoop. Previously, Sean founded Myrrix Ltd, producing a real-time recommender and clustering product evolved from Mahout. Myrrix is now part of Cloudera. Sean was primary author of recommender components in Apache Mahout, and has been an active committer and PMC member for the project. He is co-author of Mahout in Action.