• Intel
  • Microsoft
  • Google
  • Sun Microsystems
  • BT
  • IBM
  • Yahoo! Inc.
  • Zimbra
  • Atlassian Software Systems
  • Disney
  • EnterpriseDB
  • Etelos
  • Ingres
  • JasperSoft
  • Kablink
  • Linagora
  • MindTouch
  • Mozilla Corporation
  • Novell, Inc.
  • Open Invention Network
  • OpSource
  • RightScale
  • Silicon Mechanics
  • Tenth Planet
  • Ticketmaster
  • Voiceroute
  • White Oak Technologies, Inc.
  • XAware
  • ZDNet

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at scordesse@oreilly.com.

Media Partner Opportunities

Download the Media & Promotional Partner Brochure (PDF) for more information on trade opportunities with O'Reilly conferences, or contact mediapartners@oreilly.com.

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com.

OSCON Newsletter

To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON newsletter (login required).

Contact Us

View a complete list of OSCON 2008 Contacts

Build Your Own Web Archive: archive.org's Open Source Tools to Crawl, Access & Search Web Captures

Web Applications
Location: E145 Level: Intermediate

The Internet Archive, with support from other libraries around the world, has helped develop a collection of open source tools in Java to support web archiving. These include the Heritrix archival web crawler, “Wayback” for replaying historic web content, and extensions to Nutch for web archive full-text search. This session will explain the design and capabilities these tools, and quickly demo their use for the creation of a small personal web archive.

Heritrix has been designed for faithful and complete content archiving but has also found use in other web search contexts. Wayback allows URL-based lookup and follow-up browsing of archived web content. Nutch, as applied to archival web crawls, allows Google-style full-text search of web content, including the same content as it changes over time. Together, they provide everything necessary to archive and access accurate historical records of web-published content.

Gordon Mohr

Internet Archive, Web Group

Gordon Mohr leads software development for the Internet Archive’s public and open source web archiving projects, including the Heritrix web crawler, Nutch-based archive text search engine, and Wayback Machine archive browser.

Before joining the Internet Archive, Gordon helped create other innovative applications for the Internet, including Bitzi Bitpedia, a collaborative digital media encyclopedia, Activerse Ding, an instant-messaging platform, and ParcPlace VisualWave, an early web application server and development environment.

Gordon has a BA from the University of California, Berkeley with a double-major in Computer Science and Economics.

OSCON 2008