Data is the new oil. In order to to extract as much intelligence as you can from ever-growing volumes of data, you have to provide unfettered access to data to your data scientists, but you also have to ensure you are preserving the privacy of the data that your users have entrusted you with.
LinkedIn houses the most valuable professional data in the world. Protecting the privacy of member data has always been paramount. Shirshanka Das and Tushar Shanbhag discuss the path LinkedIn has taken to protect member privacy on its scalable distributed data ecosystem built around Kafka, Hadoop, and other OSS technologies, specifically diving into the systems and processes LinkedIn created to address the Irish Data Protection Commission. Like most companies, in the early days, its first priority was getting data flowing freely and reliably. Over the past few years, the company has made significant advances in data governance, going above and beyond the commitments it has made to members in how it handles their data.
Shirshanka and Tushar outline three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement framework, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future when the General Data Protection Regulation goes into effect in 2018 and outlines LinkedIn’s plans to address those requirements as well as the challenges that lie ahead. But technology is just part of the solution. You’ll also hear about the cultural and process change at LinkedIn and lessons learned about sustainable process and governance.
Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He’s working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform; and Dali, a data virtualization layer for Hadoop.
Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high-growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.
Comments on this page are now closed.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org