The Web As The Greatest Dataset Of All Time

Average rating: ****.
(4.50, 2 ratings)

Using the web to view pages one at a time is like using a personal computer to only store recipes. Just as most people immensely underestimated the full potential of personal computers 30 years ago, most people have yet to recognize the full potential of web data today.

The web is the largest collection of information in human history and growing at a staggering rate – the estimated number of webpages grew from 26 million in 1998 to 1 billion in 2000 and hit the 1 trillion mark back in 2008. Modern big data tools and technological advances lowering compute costs have made it possible to gain extremely valuable insight from large scale analysis of web data, but until recently few people had access to the data. Now tools like Grep the Web, indexing services that provide raw access to web data, and repositories of open data make it possible for almost anyone to extract knowledge from the web that was previously only available to large search engine companies.

This presentation will explain the various tools available and share examples of powerful insights gained from analysis of web data, including results from recent research projects. You will hear firsthand from Blekko CTO Greg Lindahl about their motivation for building the free Grep the Web service, how they built it, and what they have learned from it. Kevin Burton, Spinn3r CEO, co-inventor of RSS, and long time Apache contributor will share his experiences from a decade of analyzing web content. Lisa Green brings her insight into the relationship between data accessibility and innovation. The three speakers will discuss practical applications of web data analysis that you can incorporate into your research or products.

Photo of Lisa Green

Lisa Green

Common Crawl

Lisa Green is the Director at the Common Crawl Foundation where she
oversees the foundation’s mission of building, maintaining and openly
disseminating a comprehensive crawl of the web. Prior to joining
Common Crawl, she was the Chief of Staff at Creative Commons. Lisa
holds a PhD in physical chemistry from the University of California
Berkeley, lives in San Francisco and is passionate about open systems.

Photo of Greg Lindahl

Greg Lindahl


Greg Lindahl is the Founder and CTO of Blekko. Previous to founding Blekko, he founded PathScale where he architected the software and hardware and saw the company through to a highly successful exit. He is the author of several patents.

Photo of Kevin Burton

Kevin Burton


Founder/CEO of Spinn3r, co-inventor of RSS, Apache contributor, and big data geek.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts