Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Malicious site detection with large-scale belief propagation

Alexander Ulanov (Hewlett Packard Labs), Manish Marwah (Hewlett Packard Labs)
2:40pm3:20pm Thursday, March 16, 2017
Platform Security and Cybersecurity, Spark & beyond
Location: LL21 C/D Level: Advanced
Secondary topics:  Hardcore Data Science

Who is this presentation for?

  • Data scientists and big data engineers

Prerequisite knowledge

  • A basic understanding of Apache Spark

What you'll learn

  • Understand the challenges in implementing large-scale graph analytics algorithms
  • Learn how to solve common challenges for implementing belief propagation
  • Discover how to use graph analytics to discover malicious websites from a large web graph
  • Explore a new BP Spark package


A lot of real-world data can be naturally modeled as a graph (e.g., computer network interactions, social networks, and transactions between bank accounts). Graphs provide a rich and powerful representation that captures interactions, dependencies, and/or similarity between entities.

Some applications of graphs include inferring malicious websites or fraudulent transactions. The size of such graphs, which may exceed billions of vertices, and their power-law nature make it challenging to perform real-time or near to real-time analytics, making the analysis worthless in many cases. Large-scale graphs need to be partitioned and distributed across a cluster because they require large amounts of memory to store and process the graph. This leads to intensive communication between cluster nodes.

Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP), a widely used algorithm for performing inference on probabilistic graphical models, on top of Apache Spark GraphX. Applications of BP include fraud detection, malware detection, computer vision, and customer retention. To handle large-scale graphs, Alexander and Manish leverage a number of strategies:

  • They build on top of Apache Spark GraphX.
  • They use efficient graph partitioning strategy to reduce communication overhead.
  • They use efficient memory management.
  • They employ shared memory for high-speed communication.

To evaluate performance and demonstrate effectiveness of the approach, Alexander and Manish model real, large-scale hyperlinked web-crawl data as a graphical model and apply the BP algorithm to infer the probability of websites to be malicious in near real time.

They are planning to open source their implementation as a Spark package.

Photo of Alexander Ulanov

Alexander Ulanov

Hewlett Packard Labs

Alexander Ulanov is a senior researcher at Hewlett Packard Labs, where he focuses his research on machine learning on a large scale. Currently, Alexander works on deep learning and graphical models. He has made several contributions to Apache Spark; in particular, he implemented the multilayer perceptron classifier. Previously, he worked on text mining, classification and recommender systems, and their real-world applications. Alexander holds a PhD in mathematical modeling from the Russian Academy of Sciences.

Photo of Manish Marwah

Manish Marwah

Hewlett Packard Labs

Manish Marwah is a senior research scientist at Hewlett Packard Labs. His main research interests are in the broad area of data science, and its applications to cyber-physical systems, such as smart buildings and data centers. In particular, his research has focused on designing data mining methods for sustainability and energy management. Recently, he has been looking at large-scale analytics and its applications to IoT and security domains. His research has led to over 60 refereed papers, several of which have won awards, including at KDD 2009, IGCC 2011, and AAAI 2013. He has been granted 35 patents. Manish holds a PhD in computer science from the University of Colorado, Boulder and a BTech from the Indian Institute of Technology, Delhi.