Before any analysis can begin, a data scientist needs to discover the right data sources to analyze, understand them, and determine whether they can trust them. Unfortunately, data discovery is very inefficient today. Countless hours are lost trying to find the right data to use. (The most common way still remains to ask a coworker.) Gaining trust in data requires running a bunch of queries (max timestamp, counts per day, count distincts, etc.) that waste time and add unnecessary load on the databases. There’s no clear way to know how to find folks to answer questions about the table. And worst of all, many times analysis is redone and models are rebuilt because previous work isn’t discoverable.
Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Amundsen is built on three key pillars: an augmented data graph, an intuitive user experience, and centralized metadata. Amundsen uses a graph database under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What’s unique to Amundsen is that it treats people as a first-class data asset; in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables, and dashboards). In addition, Amundsen runs PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet. Finally, Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress.
Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, page rank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model.
Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.
Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling, and performance at LinkedIn and Oracle.
Comments on this page are now closed.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org