Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Protecting sensitive data in huge datasets: Cloud tools you can use

Felipe Hoffa (Google)
11:1511:55 Wednesday, 1 May 2019
Average rating: ***..
(3.50, 4 ratings)

Who is this presentation for?

  • Data scientists and data engineers



Prerequisite knowledge

  • Familiarity with SQL

What you'll learn

  • Learn how to identify PII in massive datasets
  • Explore k-anonymity, l-diversity, and related research and options such as removing, masking, and coarsening
  • Gain experience with practical demos over massive datasets


Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm. You’ll also cover options such as removing, masking, and coarsening.

Related research: “Considerations for sensitive data within machine learning datasets”

Photo of Felipe Hoffa

Felipe Hoffa


Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.