Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit

Marcel Ruiz Forns (Wikimedia Foundation)
14:0514:45 Thursday, 2 May 2019
Secondary topics:  Security and Privacy
Average rating: ****.
(4.75, 4 ratings)

Who is this presentation for?

  • Software developers, data engineers, analysts, researchers, and those interested in privacy



Prerequisite knowledge

  • A basic understanding of software development and data analysis

What you'll learn

  • Explore challenges, gotchas, failures, and successes of the Wikimedia Foundation's analytics team when trying to comply with strict privacy policies while keeping the value of the data


Privacy is one of the lesser-known charms of Wikipedia. Wikipedia‚Äôs stance on privacy allows users to access and modify the wiki in anonymity, without fear of giving away personal information or editorship or browsing history. The Wikimedia Foundation (WMF), the nonprofit behind Wikipedia’s software and infrastructure, has strict privacy and data retention policies that were developed with the Wikimedia community at large. Practically all data containing personal identifiers or user activity must be deleted or anonymized 90 days, at most, after its collection.

However, the organization and the community are eager to use big data to better understand the ecosystem and improve it. As of this writing, developer teams are sending more than 2,000 custom events per second to the analytics pipeline and constantly feeding 200+ datasets. That is in addition to the 10 billion (US) web request logs that are ingested daily into the Hadoop cluster and are used to populate several important tools, like WMF’s analytics API. The long-term existence of this data is key to the foundation’s analysts and researchers.

Is it possible to retain value from these datasets when they are controlled by such strict privacy policies? How can you ensure that new datasets follow the policies without bureaucratic bottlenecks? What advantages does sanitizing data have beyond compliance?

Marcel Ruiz Forns covers the challenge of maintaining the value of data while significantly reducing the risk of user identification and privacy loss, and how WMF’s analytics team approaches it. Discover whether your company would benefit from the vegan data diet.

Photo of Marcel Ruiz Forns

Marcel Ruiz Forns

Wikimedia Foundation

Marcel Ruiz Forns is a software engineer on the analytics team at the Wikimedia Foundation. He believes it’s a privilege to be able to professionally contribute to Wikipedia and the free knowledge movement. He’s also worked on quite disparate things such as recommender systems, serious games, natural language processing, and…selling hand-painted T-shirts on the beach of Natal, Brazil.