Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Analytics at Wikipedia

Andrew Otto (Wikimedia Foundation), Fangjin Yang (Imply)
1:15pm1:55pm Thursday, September 28, 2017
Big data and the Cloud, Data Engineering & Architecture
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Data for good, Media, Platform

Who is this presentation for?

  • Architects and developers

Prerequisite knowledge

  • A basic understanding of distributed systems

What you'll learn

  • Learn how the Wikimedia Foundation uses Druid for analytics

Description

The Wikimedia Foundation (WMF) is a nonprofit charitable organization. As the parent company of Wikipedia, one of the most visited websites in the world, WMF faces many unique challenges around its ecosystem of editors, readers, and content. Andrew Otto and Fangjin Yang explain how the WMF does analytics and offer an overview of the technology it uses to efficiently process pageviews that at peak run at about 200,000 reqs/sec.

Many folks may not realize the WMF has a dedicated data analytics team that is responsible for building out the foundation’s logging and data mining infrastructure—and for making Wikimedia-related statistics available to the other teams at the foundation and, perhaps more importantly, to the world at large. Analytics tracking in the Wikimedia movement started with measuring article and editor counts and has grown to support various metrics, summary formats, and visualizations. As the analytics capabilities grow more sophisticated, they play an increasingly important role in helping guide decisions.

One of the technologies the foundation leverages for its analytics is the Druid open source project, a column-oriented distributed database. Andrew and Fangjin cover Druid’s architecture and use cases and explain how it has complemented workflows at WMF.

Photo of Andrew Otto

Andrew Otto

Wikimedia Foundation

Andrew Otto is a systems engineer at the Wikimedia Foundation, where he supports the analytics team by architecting and maintaining small and big data analytics infrastructure. Previously, Andrew was the lead systems administrator at CouchSurfing.org. He is based in Brooklyn, NY, and spends too much time playing hardcourt bike polo.

Photo of Fangjin Yang

Fangjin Yang

Imply

Fangjin Yang is a coauthor of the open source Druid project and a cofounder of Imply, a data analytics startup based in San Francisco. Previously, Fangjin held senior engineering positions at Metamarkets and Cisco Systems. Fangjin has a BASc in electrical engineering and an MASc in computer engineering from the University of Waterloo, Canada.