Amundsen: An open source data discovery and metadata platform
Who is this presentation for?Data engineers, data architects, developers
Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen, the data discovery platform. Since it has been open-sourced, Amundsen has been used and extended by many different companies within the community.
Amundsen is built on three key pillars: the augmented data graph, intuitive user experience, and centralized metadata. Amundsen uses a graph database in the augmented data graph to store relationships between various data assets (tables, dashboards, protobuf events, etc.) What’s unique to Amundsen is that it brings all related metadata (usage, last updated, watermark, stats, etc.) into this graph. It also treats people as a primary data asset—in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables and dashboards). This solves interesting problems, such as ramping up, by answering what a team member’s frequently used table is. Amundsen strives to provide an intuitive user experience and to deliver relevant data discovery by running PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet. And Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress.
Jin Hyuk Chang and Tao Feng explore what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. You’ll see Amundsen’s architecture and how it achieves the three design pillars. More importantly, you’ll discover how Amundsen could be customized and extended for other companies’ data ecosystems. Jin and Tao share a future road map of the project, what problems remain unsolved, and how you can work together to solve them.
- A basic understanding of data science workflows
What you'll learn
- Learn how to reduce time to data discovery in your own organizations using Amundsen
Jin Hyuk Chang
Jin Hyuk Chang is a software engineer on the data platform team at Lyft, working on various data products. Jin is a main contributor to Apache Gobblin and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Services, focused on big data and service-oriented architecture.
Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling, and performance at LinkedIn and Oracle.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires