Corpus Bootstrapping with NLTK

Jacob Perkins (Weotta)
Deep Data, A-B
Average rating: ***..
(3.00, 1 rating)

When it comes to natural language processing, general APIs and generic models are often far less accurate than you want. Or maybe the APIs you need don’t even exist. Either way, you can use “corpus bootstrapping” to create custom models and APIs. Corpus bootstrapping is a method of rapidly producing a custom corpus for training highly accurate natural language processing models. For example, suppose you want to do sentiment analysis for Spanish text, but you can only find APIs and models for English. Or you want to do phrase extraction for phrases that are not exactly noun phrases. Maybe you want to classify text but there’s no corpus in existence with the categories you’re interested in. All of these problems can be solved by iterating your way to a custom corpus for training custom models.

This talk will cover:

  • creating a classified corpus from scratch
  • generating a sentiment analysis corpus in Spanish by starting with an English corpus
  • using simplified part-of-speech tags to quickly produce a custom corpus for phrase extraction
  • training custom models with NLTK-Trainer

Code examples will be in Python using NLTK.

Jacob Perkins


Jacob is the cofounder & CTO of Weotta and the author of Python Text Processing with NLTK 2.0 Cookbook. He blogs at Streamhacker and has created both the NLTK Demos & APIs and NLTK-Trainer.


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at

View a complete list of Strata contacts