Sep 23–26, 2019
Please log in

The evolution of metadata: LinkedIn’s story

Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 23/24
Average rating: ****.
(4.50, 10 ratings)

Who is this presentation for?

  • Data engineers, data scientists, AI engineers, and decision makers who have some idea about metadata’s importance in the big data world but are trying to implement a sustainable full-scope metadata strategy in their companies

Level

Intermediate

Description

LinkedIn began with a series of fundamental questions at the heart of its metadata evolution—what metadata is, what data constructs it applies to, when it should be collected, when and how it should be stored, what you can do with it, and how you can scale it to a million data constructs, thousands of people, and hundreds of teams.

The journey started with a small team trying to improve the searchability of Hadoop data. Over the years, this system has grown to be the central data hub where the entirety (more than a million) of data assets at LinkedIn (online, streaming, and batch) have a home. This system is deployed at global scale and powers data productivity for all engineers and data enthusiasts while serving as critical infrastructure for data privacy by default in LinkedIn’s data systems.

Shirshanka Das and Mars Lan examine different metadata strategies for modeling metadata, storing metadata, and then scaling the acquisition and refinement of metadata for thousands of metadata authors and producing systems. They dive into the pros and cons of each strategy and in which scenarios they think organizations should deploy them. They explore strategies including generic types versus specific types, crawling versus publish/subscribe, single source of truth versus multiple federated sources of truth, automated classification of data, lineage propagation, and more.

They also outline different axes on which they’ve been tested on scale, the sheer number of entities, the richness of metadata, the connectivity between entities, the velocity of evolution of the metadata model, and the efficiency of serving metadata for simple and complex queries. You’ll see the metadata system LinkedIn has innovated on over the years that allows for rich extensible types, supports different types of data entities, and provides efficient storage and retrieval of metadata in both site-serving use cases and graph-analytic use cases and scales well to support distributed development models. They’ll outline the relationship of this metadata system to other well known systems like the Hive metastore, the Kafka schema registry, Apache Atlas, and Cloudera Navigator.

While the storage abstractions and metadata models are key to a scalable system, without an intuitive interface and UX for this metadata, the understandability of the overall ecosystem is severely limited. Shirshanka and Mars detail the design challenges faced in making metadata insightful for data producers and consumers and what strategies have worked.

Prerequisite knowledge

  • Familiarity with the concept of metadata; Hadoop, Kafka, Spark, and other big data technologies; and metadata systems such as Hive metastore, Apache Atlas, and Navigator

What you'll learn

  • Learn the importance of metadata and strategies for metadata implementation
Photo of Shirshanka Das

Shirshanka Das

LinkedIn

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He’s working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform; and Dali, a data virtualization layer for Hadoop.

Photo of Mars Lan

Mars Lan

LinkedIn

Mars Lan is a staff software engineer at LinkedIn, where he’s been leading the team to design and implement LinkedIn’s metatdata infrastructure for the past two years. Previously, he worked on Google Assistant and Google Cloud products at Google. Mars earned his PhD in computer science from UCLA.

Comments on this page are now closed.

Comments

Picture of Shirshanka Das
Shirshanka Das | Principal Staff Software Engineer and Architect
09/30/2019 8:43am EDT

Slides have been shared with the conference organizers, so they should be linked here shortly.

They are also available here: https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019

Picture of Kaushik Deka
Kaushik Deka | Director, Novantas
09/30/2019 5:06am EDT

Can you please post the slides?

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  • Infoworks.io, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires