Dirty Politics, Dirty Data: Taming the Federal Election Commission’s Database

Real World
Location: Mission City B1
Average rating: *****
(5.00, 1 rating)

For the first time, Forbes this year included information on political contributions with its Forbes 400 list of the richest people in America. Gathering this data was not easy; the Federal Election Commission publishes contribution records as they’re submitted by campaigns, including typos, misspellings and attempts by campaigns and donors to obscure contributions.

Forbes overcame this difficulty by developing a data-cleaning wizard for the FEC’s data that fit easily into our researchers’ workflow. We downloaded all 6 million post-2006 contribution records in the FEC database and used our wizard to help researchers find and select the identities under which billionaires donate to political organizations. After these identities are selected, the wizard automates the importing of corresponding donation records to the massive MySQL database that contains all of Forbes’s data on billionaires.

The resulting database contains more than 20,000 contributions by 400 billionaires to 1,500 political committees. This data is largely self-updating (we import more donation records automatically every time the FEC updates its database) and will be used through the year to produce articles that examine the influence of money on politics.

The problem that we addressed is one faced by many data users: an important field—in this case, donor name—was not identified by any kind of individual ID; instead, these had to be developed by identifying combinations of fields that correspond to individuals. Our method, combining some hand identification with automated database operations, yields extremely clean data (compared, in particular, with results compiled by other sites that process FEC data) with only moderate human action.

I’ll discuss the shortcomings in the FEC’s data that led us to create the data-cleaning wizard, demonstrate the functionality of the wizard and describe its mechanism, and execute sample queries of the sort that will lead to close coverage of billionaires’ political activity.

Photo of Jon Bruner

Jon Bruner

O'Reilly Media

Jon Bruner is Deputy Editor for New Products Forbes, where he develops new editorial concepts for the web site and magazine and writes occasionally about politics, technology, and finance. He earned a B.S. in mathematics and economics at the University of Chicago.


  • Thomson Reuters
  • EMC Data Computing Division
  • EnterpriseDB
  • Microsoft
  • Gnip
  • Rackspace Hosting
  • IBM
  • Windows Azure MarketPlace DataMarket
  • Amazon Mechanical Turk
  • Amazon Web Services
  • Aster Data
  • Cloudera
  • Clustrix
  • DataStax, Inc. (formerly Riptano, Inc.)
  • Digital Reasoning Systems
  • Heritage Provider Network
  • Impetus
  • Jaspersoft
  • Karmasphere
  • LinkedIn
  • MarkLogic
  • Pentaho
  • Pervasive
  • Revolution Analytics
  • Splunk
  • Urban Mapping
  • Wolfram|Alpha
  • Esri
  • ParAccel
  • Tableau Software

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com

Download the Strata Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Strata Contacts