Sep 23–26, 2019

The Evolution of Metadata: LinkedIn’s story

Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data quality, data governance and data lineage, Media and Advertising

Who is this presentation for?

Data engineers, Data scientists, AI engineers, Decision makers who have some idea about metadata’s importance in the big data world, but are trying to implement a sustainable full-scope metadata strategy in their companies.

Level

Intermediate

Description

  • What is metadata?
  • What sorts of data constructs does it apply to?
  • When should you collect it?
  • Where and how should you store it?
  • What can you do with it?
  • How do you scale it to a million data constructs, thousands of people, and hundred of teams?

These fundamental questions are at the heart of LinkedIn’s metadata evolution. A journey that started with a small team trying to improve the searchability of Hadoop data. Over the years, this system has grown to be the central data hub where the entirety (more than a million) of data assets at LinkedIn (online, streaming and batch) have a home. This system is deployed at global scale, powers data productivity for all engineers and data enthusiasts, while serving as critical infrastructure for data privacy by default in our data systems.

In this talk, we focus on different metadata strategies for modeling metadata, storing metadata and then scaling the acquisition and refinement of metadata for thousands of metadata authors and producing systems. We discuss the pros and cons of each strategy and in which scenarios we think organizations should deploy them. Strategies discussed include generic types versus specific types, crawling versus publish-subscribe, single source of truth versus multiple federated sources of truth, automated classification of data, lineage propagation and more!

We also discuss different axes on which we’ve been tested on scale, the sheer number of entities, the richness of metadata, the connectivity between entities, the velocity of evolution of the metadata model as well as the efficiency of serving metadata for simple and complex queries.

We present a metadata system we’ve innovated on over the years, that allows for rich extensible types, supports different types of data entities, provides efficient storage and retrieval of metadata in both site-serving use-cases as well as graph analytic use-cases and scales well to support distributed development models. We also discuss the relationship of this metadata system to other well known systems like the Hive metastore, the Kafka schema registry, Apache Atlas and Cloudera Navigator.

While the storage abstractions and metadata models are key to a scalable system, without an intuitive interface and UX for this metadata, the understandability of the overall ecosystem is severely limited. We discuss the design challenges we have faced in making metadata insightful for data producers and consumers and what strategies have worked.

Prerequisite knowledge

Familiarity with the concept of metadata. Familiarity with Hadoop, Kafka, Spark and other big data technologies Familiarity with metadata systems such as Hive Metastore, Apache Atlas and Navigator.

What you'll learn

1. The Importance of Metadata: You must have a cohesive metadata strategy to take advantage of all the good work your data science and AI teams are doing and enable them to be more productive. In a privacy-by-default world, a holistic metadata strategy that is front and center for your organization is important. 2. Strategies for Metadata implementation: How to think about different metadata strategies and which one is the right one for your organization? What are the different systems in this space and which one fits your requirements the best? How can you apply LinkedIn’s learnings to your environment?
Photo of Shirshanka Das

Shirshanka Das

LinkedIn

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine, Gobblin, a data lifecycle management platform for Hadoop, WhereHows, a data discovery and lineage platform, and Dali, a data virtualization layer for Hadoop.

Photo of Mars Lan

Mars Lan

LinkedIn

Mars is currently the technical lead of the metadata team at LinkedIn and has been leading the team to design and implementation of LinkedIn’s metadata infrastructure for the past 2 years. Prior to that he was a software engineer at Google working on the Google Assistant and Google Cloud products. Mars received his PhD degree in Computer Science from UCLA.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts