Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

PDF Prison Break: Freeing Data, Empowering Experts at

John Akred (Silicon Valley Data Science), Karim Qazi (
2:35pm–3:15pm Friday, 10/17/2014
Business & Industry
Location: 1 E14/1 E15
Average rating: *****
(5.00, 1 rating)

It’s cliché at this point to observe that much critical information in an enterprise is difficult to access and benefit from due to lack of structure and difficult formats. We will present a story of a successful data prison break, empowered by data science and engineering, that transformed Edmunds’ ability to get new products to market.

Edmunds and Silicon Valley Data Science built a system that dramatically reduces the manual effort required to prepare new models to be discoverable on We will discuss how we implemented this capability using Idibon’s natural language processing capabilities, combined with extraction capabilities to feed data trapped in PDFs into that library, and the integration of the entity resolution capability into the larger process of defining a new vehicle model.

PDFs are often the bane of data science. Great for distributing human readable content across a diverse technology landscape, they are not well suited to making data machine-readable. The first challenge to defining a new car model for the site is to extract the features and configuration options from the unstructured PDF documents the manufacturers provide to describe them. Much of the actual data is contained in tables where position and symbolic characters contain the critical information. We will discuss the different approaches available and what ultimately worked for the common but more difficult case of semi-structured (tabular) PDF data.

There are many options available for doing entity recognition. We will discuss our experience with the Idibon SaaS offering we used. Important considerations were how we had to subdivide the problem to take advantage of any of the available libraries, including the one we ultimately used. We will describe how we structured the plain text descriptive data and fed it to the entity resolution service to get predictions on the features that were represented by the text. This allowed us to build a solution resilient to the variations in how various features are described.

Finally, we will discuss how the output of the analytical system was delivered to the end users to speed the task of defining new models. As Gary Kasparov and others have noted, the combination of humans and machines typically is more capable than either by themselves. Systems that empower human experts rather than attempting to replace them can be very powerful indeed, and this is such a case study.

Photo of John Akred

John Akred

Silicon Valley Data Science

With over 15 years in advanced analytical applications and architecture, John is dedicated to helping organizations become more data-driven. He combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Photo of Karim Qazi

Karim Qazi

Karim Qazi accomplished Software Engineer and Technical Leader with extensive experience in using Agile and Test-Driven-Development best practices to build automated, fault tolerant and highly available software.