Crawling the Web to Keep It Open

Lisa Green (Common Crawl)
Location: Online Level:

As the largest and most diverse collection of information in human
history, the web grants us tremendous insight if we can only
understand it better. Web crawl data can be used to spot trends and
identify patterns in economics, health, politics, popular culture and
many other aspects of life. It provides an immensely rich corpus for
scientific research, technological advancement, and innovative new
businesses. It is crucial for our information-based society that the
web be openly accessible to anyone who desires to utilize it.

Common Crawl produces and maintains a repository of web crawl data
that is openly accessible to everyone. The crawl currently covers 5
billion pages and includes valuable metadata. Small startups or even
individuals can now access high quality crawl data that was previously
only available to large search engine corporations.

In this session, Common Crawl Director Lisa Green will discuss the
value of open crawl data, explain how the Common Crawl corpus can be
accessed, and give examples of how the it is currently being used in
research, education and business.

Photo of Lisa Green

Lisa Green

Common Crawl

Lisa Green is the Director at the Common Crawl Foundation where she
oversees the foundation’s mission of building, maintaining and openly
disseminating a comprehensive crawl of the web. Prior to joining
Common Crawl, she was the Chief of Staff at Creative Commons. Lisa
holds a PhD in physical chemistry from the University of California
Berkeley, lives in San Francisco and is passionate about open systems.