As data practitioners, we come to Strata because we are excited by the opportunities to unlock the value in data. But as individuals, we are each sensitive to how our own data is used, and we want our privacy to be respected. We expect organizations to keep our data secure, but we also expect them to use our data ethically and not exploit or leak our private data. Many citizens are simply unaware of the degree to which their trails of data can reveal highly private information. Meanwhile, organizations are not doing enough to preserve privacy; they need to find privacy-preserving ways to analyze and operationalize data.
Organizations may be open to far greater liability due to possible customer reidentification than they realize. Jason McFall surveys the risks around private data and discusses some examples of privacy breaches where well-meaning and responsible organizations inadvertently violated privacy because they didn’t understand the threats they faced—including linkage attacks, where connecting data to a public dataset can reveal privacy; network graph matching: identifying segments of a graph (such as a social graph) and then walking the graph; and the risks of aggregate data, where often a single data point seems innocuous in isolation but in aggregate can reveal very private information—in real-world examples such as mining social network comments, likes, and friend graphs; connecting location information to learn patterns about where a person lives, works, and travels; or exploiting Internet of Things data.
Jason outlines techniques that enable the safe and effective use of data while preserving privacy, including tokenization and masking, generalization and blurring of data (such as k-anonymity), controlled privacy-preserving querying of data (such as differential privacy), homomorphic encryption, and randomized responses for the IoT, and explores the strengths and weaknesses of these approaches, before listing some key lessons that individual citizens, organizations, and data scientists need to know about privacy.
Jason McFall is the CTO at Privitar, a London startup using machine learning and statistical techniques to open up data for safe secondary use, without violating individual privacy. Jason has a background in applying machine learning to marketing automation and customer analytics. Before that, he was an experimental physicist, working on particle physics collider experiments.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.