Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Code Property Graph: A modern, queryable data storage for source code

Fabian Yamaguchi (ShiftLeft)
17:2518:05 Wednesday, 23 May 2018
Data science and machine learning
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Security and Privacy, Time Series and Graphs
Average rating: ****.
(4.33, 3 ratings)

Who is this presentation for?

  • Data analysts, security and performance engineers, and developers

Prerequisite knowledge

  • A basic understanding of at least one programming language and graph theory

What you'll learn

  • Explore techniques for large-scale graph storage in graph databases, cutting-edge advances in querying such high-volume data, and using graphical representation of application code


Modern software and its development process has become tremendously complex. Systems are now built on polyglot environments with multiple dependencies and large code bases. As the code size increases and new code contributors are added, it is imperative for developers to have an in-depth understanding of code itself.

Fabian Yamaguchi offers an overview of Code Property Graph (CPG), a unique approach that presents code as a queryable collection of data developers can interact with and ask relevant questions—much like a search engine. CPG allows the functional elements of code such as variables and methods to be represented in an interconnected graph of data and control flows—think of it like Facebook’s graph search, but the functions and variables are now your friends—which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

The CPG-based data structure for storing code allows us to identify associations between function and data and query them for finding known bugs and issues. This queryable representation of entire software’s codebase also allows developers to identify severe security and performance regression issues before they hit production environments. It gives them insight to explore and quickly find solutions that would have taken a large amount of their time. The data stored in large CPGs can also be mined for automated analysis which brings out associations with different code segments.

Topics include:

  • Strategies, pitfalls, and alternate representations of code
  • Internals of the CPG data storage representation
  • Scaling up the CPG for distributed storage
  • Generating large-scale queries on CPG representation of code
  • Using CPG to build real-life tools
Photo of Fabian Yamaguchi

Fabian Yamaguchi


Fabian Yamaguchi is the chief scientist at ShiftLeft. Fabian has over 10 years of experience in the security domain, where he has worked as a security consultant and researcher focusing on manual and automated vulnerability discovery. He has identified previously unknown vulnerabilities in popular system components and applications such as the Microsoft Windows kernel, the Linux kernel, the Squid proxy server, and the VLC media player. Fabian is a frequent speaker at major industry conferences such as Black Hat USA, DEF CON, First, and CCC and renowned academic security conferences such as ACSAC, Security and Privacy, and CCS. He holds a master’s degree in computer engineering from Technical University Berlin and a PhD in computer science from the University of Goettingen.