Crunching Common Crawl with the Cloud-Based MIA Platform

Lisa Green (Common Crawl), Peter Adolphs (Neofonie)
Government/Open Data
Location: 212
Average rating: **...
(2.00, 1 rating)

The Web is a treasure trove filled with all kinds of information people care about. However, up until recently this dataset was reserved for the “big players”, because only they had the infrastructure and expertise to acquire and work on a dataset of this scale. This changed with the advent of Cloud Computing as it allows virtually anyone to to access and process datasets at Web scale on a pay-as-you-go basis. Furthermore, recent advances in Big Data and Natural Language Processing technologies enable us to unlock raw Web data and gain insights out of it.

In this joint presentation we will introduce Common Crawl, an open repository of Web data, and MIA, a cloud-based platform for processing and analyzing Web data.

Common Crawl provides over a petabyte of open data that anyone can access for free. We will detail the types of data in the Common Crawl corpus and explain how you can access the data directly.

MIA is a cloud-based platform for processing and analyzing web data. End users can describe their analytical task in a structured query language based on SQL. We will describe the functionality of the platform to gather relevant text data, extract information, join, aggregate, group and return results as database tables. We will provide examples of analysis done on the MIA platform and discuss practical application of web data analysis that you can incorporate into your research or products.

Photo of Lisa Green

Lisa Green

Common Crawl

Lisa Green is the Director at the Common Crawl Foundation where she
oversees the foundation’s mission of building, maintaining and openly
disseminating a comprehensive crawl of the web. Prior to joining
Common Crawl, she was the Chief of Staff at Creative Commons. Lisa
holds a PhD in physical chemistry from the University of California
Berkeley, lives in San Francisco and is passionate about open systems.

Photo of Peter Adolphs

Peter Adolphs


R&D Project Manager at Neofonie, a Berlin based full service company providing text and data mining software and services. Peter studied computer science and linguistics at the Humboldt University in Berlin and worked for 6 years in the German Research Centre for Artificial Intelligence (DFKI). His focus of work is the development and improvement of information access and knowledge management solutions by leveraging linguistic analyses at various levels and applying them to large amounts of data, for arriving at an application-specific semantics with minimalized human efforts.