A lot of real-world data can be naturally modeled as a graph (e.g., computer network interactions, social networks, and transactions between bank accounts). Graphs provide a rich and powerful representation that captures interactions, dependencies, and/or similarity between entities.
Some applications of graphs include inferring malicious websites or fraudulent transactions. The size of such graphs, which may exceed billions of vertices, and their power-law nature make it challenging to perform real-time or near to real-time analytics, making the analysis worthless in many cases. Large-scale graphs need to be partitioned and distributed across a cluster because they require large amounts of memory to store and process the graph. This leads to intensive communication between cluster nodes.
Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP), a widely used algorithm for performing inference on probabilistic graphical models, on top of Apache Spark GraphX. Applications of BP include fraud detection, malware detection, computer vision, and customer retention. To handle large-scale graphs, Alexander and Manish leverage a number of strategies:
To evaluate performance and demonstrate effectiveness of the approach, Alexander and Manish model real, large-scale hyperlinked web-crawl data as a graphical model and apply the BP algorithm to infer the probability of websites to be malicious in near real time.
They are planning to open source their implementation as a Spark package.
Alexander Ulanov is a senior researcher at Hewlett Packard Labs, where he focuses his research on machine learning on a large scale. Currently, Alexander works on deep learning and graphical models. He has made several contributions to Apache Spark; in particular, he implemented the multilayer perceptron classifier. Previously, he worked on text mining, classification and recommender systems, and their real-world applications. Alexander holds a PhD in mathematical modeling from the Russian Academy of Sciences.
Manish Marwah is a senior research scientist at Hewlett Packard Labs. His main research interests are in the broad area of data science, and its applications to cyber-physical systems, such as smart buildings and data centers. In particular, his research has focused on designing data mining methods for sustainability and energy management. Recently, he has been looking at large-scale analytics and its applications to IoT and security domains. His research has led to over 60 refereed papers, several of which have won awards, including at KDD 2009, IGCC 2011, and AAAI 2013. He has been granted 35 patents. Manish holds a PhD in computer science from the University of Colorado, Boulder and a BTech from the Indian Institute of Technology, Delhi.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.