Privacy is one of the lesser-known charms of Wikipedia. Wikipedia’s stance on privacy allows users to access and modify the wiki in anonymity, without fear of giving away personal information or editorship or browsing history. The Wikimedia Foundation (WMF), the nonprofit behind Wikipedia’s software and infrastructure, has strict privacy and data retention policies that were developed with the Wikimedia community at large. Practically all data containing personal identifiers or user activity must be deleted or anonymized 90 days, at most, after its collection.
However, the organization and the community are eager to use big data to better understand the ecosystem and improve it. As of this writing, developer teams are sending more than 2,000 custom events per second to the analytics pipeline and constantly feeding 200+ datasets. That is in addition to the 10 billion (US) web request logs that are ingested daily into the Hadoop cluster and are used to populate several important tools, like WMF’s analytics API. The long-term existence of this data is key to the foundation’s analysts and researchers.
Is it possible to retain value from these datasets when they are controlled by such strict privacy policies? How can you ensure that new datasets follow the policies without bureaucratic bottlenecks? What advantages does sanitizing data have beyond compliance?
Marcel Ruiz Forns covers the challenge of maintaining the value of data while significantly reducing the risk of user identification and privacy loss, and how WMF’s analytics team approaches it. Discover whether your company would benefit from the vegan data diet.
Marcel Ruiz Forns is a software engineer on the analytics team at the Wikimedia Foundation. He believes it’s a privilege to be able to professionally contribute to Wikipedia and the free knowledge movement. He’s also worked on quite disparate things such as recommender systems, serious games, natural language processing, and…selling hand-painted T-shirts on the beach of Natal, Brazil.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com