Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Disrupting data discovery

Mark Grover (Lyft)
14:0514:45 Wednesday, 1 May 2019
Average rating: ****.
(4.64, 11 ratings)

Who is this presentation for?

Architects, data engineers, software developers, and managers



Prerequisite knowledge

A basic understanding of analysis/data science workflow

What you'll learn

Learn how to improve the productivity of your data scientists through faster data discovery


Before any analysis can begin, a data scientist needs to discover the right data sources to analyze, understand them, and gain trust in them. Unfortunately, data discovery is very inefficient today. Countless hours get lost trying to find the right data to use; the most common way still remains to ask a coworker. Gaining trust in data requires running a bunch of queries—max timestamp, counts per day, count distincts, etc.—that waste time and add unnecessary load on the databases. There’s no clear way to know how to find folks to answer questions about the table. And, worst of all, many times analysis is redone and models are rebuilt because previous work is not discoverable.

Mark Grover discusses what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. Lyft has reduced time spent on data discovery 10 fold because of its data portal, Amundsen.

Amundsen is built on three key pillars:
An augmented data graph
Amundsen uses a graph database under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What’s unique to Amundsen is that it treats people as a first-class data asset—in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables and dashboards).

An intuitive user experience
Amundsen runs PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.

Centralized metadata
Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress. Mark shares ongoing efforts in this space, including RISELab’s Ground and WeWork’s Marquez projects.

Mark gives a demo of Amundsen and its goals, leads a deep dive into Amundsen’s architecture, and explains how it achieves the three design pillars. Mark closes with a future roadmap of the project, what problems remain unsolved, and how we can work together to solve them.

Photo of Mark Grover

Mark Grover


Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Comments on this page are now closed.


Picture of Mark Grover
1/05/2019 17:57 BST

Hi all,
Thanks for attending. Slides are at

And, look forward to your feedback.

And, this is the main open source repo:

Would love your feedback, thanks in advance!