Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Code Property Graph: A modern, queryable data storage for source code

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
4:20pm5:00pm Wednesday, March 7, 2018
Secondary topics:  Graphs and Time-series
Average rating: ****.
(4.00, 1 rating)

Who is this presentation for?

  • Data analysts, security and performance engineers, and developers

Prerequisite knowledge

  • A basic understanding of databases and the cloud, at least one programming language, and graph theory

What you'll learn

  • Explore techniques for large-scale graph storage in graph databases, cutting-edge advances in querying such high-volume data, and using graphical representation of application code


Modern software and its development process has become tremendously complex. Systems are now built on polyglot environments with multiple dependencies and large code bases. As the code size increases and new code contributors are added, it is imperative for developers to have an in-depth understanding of the code itself. Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that presents code as a queryable collection of data with which a developer can interact and ask relevant questions—much like a search engine. CPG allows the functional elements of code such as variables and methods to be represented in an interconnected graph of data and control flows—think of it like Facebook’s graph search, but the functions and variables are now your friends—which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

The CPG-based data structure for storing code allows us to identify associations between function and data and query them for finding known bugs and issues. This queryable representation of a software’s entire codebase also allows developers to identify severe security and performance regression issues before they hit production environments and gives them insight to explore and quickly find solutions that would have taken a large amount of their time. The data stored in large CPGs can also be mined for automated analysis, which brings out associations with different code segments.

Topics include:

  • Strategies, pitfalls, and alternate representations of code
  • Internals of the CPG data storage representation
  • Scaling up the CPG for distributed storage
  • Generating large-scale queries on CPG representation of code
  • Using CPG to build real-life tools
Photo of Vlad A Ionescu

Vlad A Ionescu


Vlad A. Ionescu is the founder and Chief Architect of ShiftLeft. Vlad is the creator of the industry’s first open source lambda framework. Previously, he worked at Google and VMware as an infrastructure engineer. Vlad is the coauthor RabbitMQ’s Erlang client.

Photo of Fabian Yamaguchi

Fabian Yamaguchi


Fabian Yamaguchi is the chief scientist at ShiftLeft. Fabian has over 10 years of experience in the security domain, where he has worked as a security consultant and researcher focusing on manual and automated vulnerability discovery. He has identified previously unknown vulnerabilities in popular system components and applications such as the Microsoft Windows kernel, the Linux kernel, the Squid proxy server, and the VLC media player. Fabian is a frequent speaker at major industry conferences such as Black Hat USA, DEF CON, First, and CCC and renowned academic security conferences such as ACSAC, Security and Privacy, and CCS. He holds a master’s degree in computer engineering from Technical University Berlin and a PhD in computer science from the University of Goettingen.