Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

Privacy by design, not an afterthought: Best practices at LinkedIn

Shirshanka Das (LinkedIn), Tushar Shanbhag (LinkedIn)
4:15pm4:55pm Wednesday, December 6, 2017
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Legal and privacy experts and big data practitioners

Prerequisite knowledge

  • A basic understanding of privacy and compliance regulations and the capabilities of the Hadoop and Kafka ecosystems with regard to encryption, access control, and compliance

What you'll learn

  • Explore how LinkedIn protects member privacy on its scalable distributed data ecosystem built around Kafka, Hadoop, and other OSS technologies


Data is the new oil. In order to to extract as much intelligence as you can from ever-growing volumes of data, you have to provide unfettered access to data to your data scientists, but you also have to ensure you are preserving the privacy of the data that your users have entrusted you with.

LinkedIn houses the most valuable professional data in the world. Protecting the privacy of member data has always been paramount. Shirshanka Das and Tushar Shanbhag discuss the path LinkedIn has taken to protect member privacy on its scalable distributed data ecosystem built around Kafka, Hadoop, and other OSS technologies, specifically diving into the systems and processes LinkedIn created to address the Irish Data Protection Commission. Like most companies, in the early days, its first priority was getting data flowing freely and reliably. Over the past few years, the company has made significant advances in data governance, going above and beyond the commitments it has made to members in how it handles their data.

Shirshanka and Tushar outline three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement framework, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future when the General Data Protection Regulation goes into effect in 2018 and outlines LinkedIn’s plans to address those requirements as well as the challenges that lie ahead. But technology is just part of the solution. You’ll also hear about the cultural and process change at LinkedIn and lessons learned about sustainable process and governance.

Photo of Shirshanka Das

Shirshanka Das


Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He’s working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform; and Dali, a data virtualization layer for Hadoop.

Photo of Tushar Shanbhag

Tushar Shanbhag


Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high-growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.

Comments on this page are now closed.


Picture of Shirshanka Das
12/13/2017 2:21pm +08

We’ve shared it with the conference organizers, so it should be posted here shortly.

Emil Laurence Pastor | DATA ENGINEER
12/12/2017 3:42pm +08

Is it possible to get copy of the slide deck?