Natural Language Processing, Advanced Analytics, and Entity Resolution at Massive Scale

Matthew Russell (Digital Reasoning Systems)
Average rating: **...
(2.69, 13 ratings)

We’ve spent the past year helping the intelligence community (IC) build a brand new system from the ground up that mines massive amounts of structured and unstructured data (high tens to 100+ million documents), stores the results in a highly flexible semantic meta-model that’s massively distributable, lends itself to fierce analysis by the men and women who protect our country and we heavily use Java-based open source software to do it.

This presentation begins with an overview of the problem domain and our experiences with performing semantic analysis on massive amounts of messy structured and unstructured data, but quickly transitions to a discussion of how we’ve leveraged open source software to create a powerful solution stack for the IC. Specifically, we’ll discuss:

  • NLP – How we’ve leveraged MALLET, MAchine Learning for LanguagE Toolkit, and other open source AI technology to build a natural language processor and entity extractor that’s competitive with the state of the art
  • NoSQL – How we’ve augmented the traditional Hadoop stack to use Cassandra–a Java-based eventually consistent storage system–instead of HBase to build a massively distributable, fault-tolerant infrastructure that runs on commodity hardware
  • UI/UX – How we’ve used Flare–an open source visualization toolkit inspired by Prefuse–to build a powerful Flex-based user interface that lends itself to advanced analysis.

A working demonstration using publicly available data will accompany the presentation (and hopefully be available at a public URL for users to tinker with during/after the presentation.

Photo of Matthew Russell

Matthew Russell

Digital Reasoning Systems

Vice President of Engineering – Digital Reasoning Systems, a firm that specializes in data mining at scale

Comments on this page are now closed.


Picture of Matthew Russell
Matthew Russell
07/23/2010 4:19am PDT

Ben – it’s actually not classified. I just need to have it reviewed before I can make it available. Once I’m able to do that, I’ll get it posted. Thanks for the kind words, BTW.

Ben Reece
07/23/2010 4:11am PDT

The presentation was informative, but was a little disappointing that it’s classified so unavailable.

  • Intel
  • Microsoft
  • Google
  • Facebook
  • Rackspace Hosting
  • (mt) Media Temple, Inc.
  • ActiveState
  • CommonPlaces
  • DB Relay
  • FireHost
  • GoDaddy
  • HP
  • HTSQL by Prometheus Research
  • Impetus Technologies Inc.
  • Infobright, Inc
  • JasperSoft
  • Kaltura
  • Marvell
  • Mashery
  • NorthScale, Inc.
  • Open Invention Network
  • OpSource
  • Oracle
  • Parallels
  • PayPal
  • Percona
  • Qualcomm Innovation Center, Inc.
  • Rhomobile
  • Schooner Information Technology
  • Silicon Mechanics
  • SourceGear
  • Symbian
  • VoltDB
  • WSO2
  • Linux Pro Magazine

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at

Download the OSCON Sponsor/Exhibitor Prospectus

Media Partner Opportunities

Download the Media & Promotional Partner Brochure (PDF) for information on trade opportunities with O'Reilly conferences or contact mediapartners@

Press and Media

For media-related inquiries, contact Maureen Jennings at

OSCON Newsletter

To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON Newsletter (login required)

OSCON 2.0 Ideas

Have an idea for OSCON to share?

Contact Us

View a complete list of OSCON contacts