As digital technologies become more pervasive in our daily lives, we are leaving an increasing amount of digital traces, including cell phone data, health records, public transportation trajectories, credit card transactions, and connected car communications. Each dataset, although anonymized, tells a story from different aspects of human life and can be used to leverage businesses resources in specific dimensions, such as optimizing mobility systems, predicting economic growth, and forecasting customer purchases. Since the data is anonymized and all the identifiable information is removed, the information is limited to each individual dataset, and there is no common field or variable to allow for datasets to be merged. However, the patterns in people’s behavior can offer a basis to fuse data at individual level from different sources and provide valuable new insights of human behaviors, leading to more effective and useful products and services that mutually benefit businesses and customers without compromising the privacy of individuals.
Behrooz Hashemian shares a novel paradigm to combine multiple anonymized datasets through pattern recognition and statistical learning techniques. This data fusion technique is based on a fundamental concept: although people’s identities are fully anonymized, the environment that they are interacting with is not, making it possible to generate new meta-information from anonymized individual trajectories and allow information from multiple sources to complement and enrich each other without compromising people’s privacy. These linked datasets establish a collective knowledge platform that helps to build solutions and make informed decisions.
Behrooz also addresses the serious privacy concerns of the technique. What if one of the datasets contains identifiable information? This may allow for re-identification of other anonymized datasets and cause a privacy breach. This way of de-anonymization not only challenges the current anonymization techniques and policies that relies on single-dataset information but also warns of the unpredictable consequence of publishing de-identified data. This issue urges for development of new security and privacy policies as well as a new privacy-guaranteed way of interacting with data.
Behrooz Hashemian is Vice President of Artificial Intelligence at VideaHealth. Previously, he was lead machine learning scientist at the MGH & BWH Center for Clinical Data Science, where he was responsible for developing and implementing state-of-the-art machine learning models to address various clinical use cases by leveraging medical imaging data, clinical time series data, and electronic health records, and chief data officer at the MIT Senseable City Lab, where he focused on innovative implementation of big data analytics and artificial intelligence in smart cities.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org