July 20–24, 2015
Portland, OR

Scalable graph analysis with Apache Giraph and Spark GraphX

Roman Shaposhnik (Pivotal Inc.)
2:30pm–3:10pm Thursday, 07/23/2015
Data Portland 252
Average rating: ****.
(4.00, 4 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

Basic knowledge of the Java programming language.


Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.

Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.

If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).

In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with builtĀ­-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.

Photo of Roman Shaposhnik

Roman Shaposhnik

Pivotal Inc.

Roman Shaposhnik worked at Sun Microsystems for 11 years until it was sold to Oracle. Since then he’s been spending his time on big data and cloud computing projects, working for a number of companies including Huawei, Yahoo!, Cloudera, and now Pivotal. By day, he is a jack-of-all-trades in Hadoop and its ecosystem projects at Pivotal. By night, he is an open source hacker, ASF IPMC member, and VP of Apache Incubator. Most of his spare time is consumed by finishing his Giraph in Action book, which is supposed to come out in fall 2014. Roman is a graduate of St. Petersburg State University. He lives in California.