• Intel
  • Microsoft
  • Google
  • Sun Microsystems
  • BT
  • IBM
  • Yahoo! Inc.
  • Zimbra
  • Atlassian Software Systems
  • Disney
  • EnterpriseDB
  • Etelos
  • Ingres
  • JasperSoft
  • Kablink
  • Linagora
  • MindTouch
  • Mozilla Corporation
  • Novell, Inc.
  • Open Invention Network
  • OpSource
  • RightScale
  • Silicon Mechanics
  • Tenth Planet
  • Ticketmaster
  • Voiceroute
  • White Oak Technologies, Inc.
  • XAware
  • ZDNet

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at scordesse@oreilly.com.

Media Partner Opportunities

Download the Media & Promotional Partner Brochure (PDF) for more information on trade opportunities with O'Reilly conferences, or contact mediapartners@oreilly.com.

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com.

OSCON Newsletter

To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON newsletter (login required).

Contact Us

View a complete list of OSCON 2008 Contacts

Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources

Emerging Topics
Location: Portland 255 Level: Intermediate
Average rating: ****.
(4.00, 11 ratings)

Wikipedia contains a wealth of collective knowledge but due to its semi-structured design and idiosyncratic markup mining this resource is a formidable challenge. This session will examine techniques for mining semantically weak data sources for explicit facts.

The session will utilize WEX and preprocessed normalization of Wikipedia designed to make this corpus easily accessible to developers interested in machine learning, natural language processing, or knowledge extraction. The process through which WEX is prepared, as a guide to creating mineable structures from semi-structured data, will be discussed followed by approaches to machine extraction on structures of mixed data quality.

The session is targeted at intermediate developers with an interest in machine learning or knowledge extraction (though no experience is assumed with either).

The demonstrations leverage the power of Postgres 8.3’s XPath capability to simplify the programming model and present examples in Python, but the data and principles are compatible with any modern data infrastructure.

Photo of Jamie Taylor

Jamie Taylor


While developing an Internet laboratory for studying economic equilibria, Jamie started one of the first ISPs in San Francisco so he could get a better connection at home.
He finally got a real job as CTO at DETERMINE Software (now a part of Selectica) helping create order in the unstructured world of Enterprise contract management.
He is now helping to organize the world’s structured information at Metaweb where he oversees data operations.

Photo of Colin Evans

Colin Evans


Colin fights information entropy on a daily basis using a wide arsenal of machine learning and semantic analytic techniques. The results of his efforts appear as millions of assertions in Freebase.
Prior to joining Metaweb, Colin helped users organize their world through his work on the IRIS semantic desktop project at SRI.

Photo of Toby Segaran

Toby Segaran


Toby Segaran is the author of the O’Reilly title, “Programming Collective Intelligence”, Amazon’s top-selling AI book, and the Director of Software Development at Genstruct, a biotechnology company. He loves applying data-mining algorithms to everything ranging from pharmaceutical trials to the Technorati Top 100.

OSCON 2008