Data versus metadata: Overcoming the challenges to securing the modern data lake
Who is this presentation for?Data engineers, data architects, developers
The recent evolution from storing data in a data warehouse to using a hybrid infrastructure of on-premises and cloud data lakes has enabled tremendous agility and scale, but it also created a security and privacy risk that current strategies don’t address. Organizations concerned about the quality of their data, protecting their brand and intellectual property, and complying with evolving privacy regulations must understand how the modern infrastructure has broken the relationship between data and metadata and how this impacts the quality and security of their data.
Because of the fundamental split between data and metadata in modern infrastructure, enterprise developers have been forced to try to handle the challenges of data quality and security in the individual applications they write. This is an extremely complex and laborious process that has met neither goal. Nong Li looks at the problems between data and metadata, the privacy and security risks associated with them, how to avoid the pitfalls of these challenges, and why companies need to get it right by enforcing security and privacy consistently across all applications.
This means that instead of sitting in the data flow between each application and Hadoop, it needs to sit in the data plane so it’s automatically accessed by all applications and automatically manages access and transformations, eliminating the need for developers to focus on this. To ensure privacy and security, the metadata and data must be managed in sync by the same system, no matter which application accesses the data. The system must be able to enforce “schema on read” and manage access controls and transformations. This is where the industry must focus its attention, and this is what organizations must demand of their vendors.
Current attempts to solve this via solutions that sit between an application and Hadoop have improved developer usability and productivity, and they can ensure data consistency, but they’re wholly inadequate for security. Only a new approach that sits in the data plane and that enforces metadata creation on write, manages user access, and performs data transformations will enable organizations to ensure data quality, protect their brands, secure their intellectual property, and comply with evolving privacy regulations.
- General knowledge of data, metadata and security, and privacy regulations
What you'll learn
- Learn why there’s such a disconnect between data and metadata and why it's absolutely critical for companies to put privacy and security first
- Understand the journey from the old data warehouse model to the modern database and the privacy and security issues that arise from this disconnect
- Discover the pitfalls to avoid when securing your modern data lake
Nong Li cofounded Okera in 2016 with Amandeep Khurana and serves as the company’s CEO. Previously, he was on the engineering team at Databricks, where he led performance engineering for Spark core and SparkSQL, and was tech lead for the Impala project at Cloudera and the author of the Record Service project. Nong is also one of the original authors of the Apache Parquet project and mentors several Apache projects, including Apache Arrow. Nong has a degree in computer science from Brown University.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires