Mar 15–18, 2020

Amundsen: An open source data discovery and metadata platform

Jin Hyuk Chang (Lyft), Tao Feng (Lyft)
1:45pm2:25pm Wednesday, March 18, 2020
Location: LL20C

Who is this presentation for?

Data engineers, data architects, developers

Level

Intermediate

Description

Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen, the data discovery platform. Since it has been open-sourced, Amundsen has been used and extended by many different companies within the community.

Amundsen is built on three key pillars: the augmented data graph, intuitive user experience, and centralized metadata. Amundsen uses a graph database in the augmented data graph to store relationships between various data assets (tables, dashboards, protobuf events, etc.) What’s unique to Amundsen is that it brings all related metadata (usage, last updated, watermark, stats, etc.) into this graph. It also treats people as a primary data asset—in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables and dashboards). This solves interesting problems, such as ramping up, by answering what a team member’s frequently used table is. Amundsen strives to provide an intuitive user experience and to deliver relevant data discovery by running PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet. And Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress.

Jin Hyuk Chang and Tao Feng explore what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. You’ll see Amundsen’s architecture and how it achieves the three design pillars. More importantly, you’ll discover how Amundsen could be customized and extended for other companies’ data ecosystems. Jin and Tao share a future road map of the project, what problems remain unsolved, and how you can work together to solve them.

Prerequisite knowledge

  • A basic understanding of data science workflows

What you'll learn

  • Learn how to reduce time to data discovery in your own organizations using Amundsen
Photo of Jin Hyuk Chang

Jin Hyuk Chang

Lyft

Jin Hyuk Chang is a software engineer on the data platform team at Lyft, working on various data products. Jin is a main contributor to Apache Gobblin and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Services, focused on big data and service-oriented architecture.

Photo of Tao Feng

Tao Feng

Lyft

Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling, and performance at LinkedIn and Oracle.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires