Building a Web Crawl Platform- 10 Lessons Learned

Rich Skrenta (IBM Watson Group)
Average rating: **...
(2.86, 7 ratings)

In this panel, Rich Skrenta, Cofounder and CEO of Blekko, will discuss the challenges technologists face in building a scalable platform that can crawl today’s Web – now an infinite number of pages littered with Spam and worse. Skrenta makes the case that despite advances in algorithmic search crawls, there still is a desperate need for human intervention and intelligence if we want to make search better given the continued explosion of content on the Web.

Attendees will discover startling insights into today’s Web landscape and the increasing challenges machines face in classifying and organzing all of the information on the Web. Blekko has searched billions of pages to date. In this Jacques Cousteau type discussion, come discover everything you’ve wanted to know about the sea of the Web.

Issues involved with building a search engine from scratch include:

- Building a distributed system to run 700 servers as a cluster datastore
- Managing reliability and access latency across 5000 drives
- Issues crawling 3B urls on the web
- Ranking: what worked, what didn’t
- Surviving launch without being branded FAIL
- Crowd sourcing index refinement tools – many eyes can make web spam a shallow problem

Photo of Rich Skrenta

Rich Skrenta

IBM Watson Group

Rich Skrenta, CEO Rich is a seasoned technology executive with nearly two decades of industry experience. Most recently he was founder and CEO of Topix, the leading online news community. Prior to Topix, Rich headed up engineering for a variety of products within AOL, including AOL Shopping, AOL Music and Netscape Search. Prior to AOL, Rich was the co-founder and CEO of NewHoo, which was acquired by Netscape and renamed the Open Directory Project. The ODP is the world’s largest human edited directory of the web and used by Google, Yahoo, AOL and many other companies. Rich was an engineering manager at Sun Microsystems prior to NewHoo and has also had development roles at Unix Systems Labs and the Amiga UNIX Group at Commodore Business Machines. Rich holds a patent in network security, and has authored many well-known pioneering software efforts, including some early multi-user online games. Rich graduated from Northwestern University with a degree in Computer Science.

  • Bundle
  • Microsoft Corporation
  • Rackspace Hosting
  • .CO
  • Serve (amex)
  • Tagged
  • Berlin Partner
  • IBT
  • OpenSRS
  • PR Newswire
  • RIM
  • SoftLayer
  • StrataScale Inc.
  • TokBox

Ally Parker

Kaitlin Pike
(415) 947-6306

View a complete list of Web 2.0 Expo contacts.