Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop. Like most companies, in the early days, LinkedIn’s first priority was getting data flowing freely and reliably. Over the past few years, the company has made significant advances in data governance, going above and beyond expectations with regard to the commitments it has made to members in how it handles their data.
Shirshanka and Tushar share how LinkedIn handled the Irish Data Protection Commissioner’s requirements for ensuring that member data was purged from all data systems including Hadoop within the required timeframe and the kind of systems the company had to build to solve it. They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data movement platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He’s working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform; and Dali, a data virtualization layer for Hadoop.
Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high-growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com
Comments
Slides will be linked here shortly.
They’re also up on slideshare at: https://www.slideshare.net/secret/uA6UFZR9NIHA2b
Is there a link to the presentation from this session? It was extremely helpful