Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

New directions in record linkage

Yves Thibaudeau (US Census Bureau)
4:40pm5:20pm Thursday, March 28, 2019
Average rating: ***..
(3.33, 3 ratings)

Level

Intermediate

Prerequisite knowledge

  • Basic knowledge of record linkage
  • Familiarity with supervised and unsupervised machine learning

What you'll learn

  • Understand what can be accomplished by record linkage packages and tools
  • Explore issues of speed and functionality in various modern environments and the challenge of specifying error bounds for linked records and missed links

Description

The US Census Bureau has been involved in record linkage (a.k.a. entity resolution) projects for over 40 years. In that time, there’s been a lot of change in computing capabilities and new techniques, as well as important new developments in machine learning algorithms and data science to support and improve record linkage processes. The Census Bureau is reviewing an inventory of linkage methodologies, including multiple homegrown methods and software packages, as it embarks on ever more challenging record linkage projects.

Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications and offers an overview of solutions, such as the homegrown linkage software BigMatch, which implements multikey quicksorting of character strings and is believed to be among the fastest software. (BigMatch is written in the low-level programming language C and is expected to be very efficient as the compiling and translating process is minimum.) Other packages under review include other Census Bureau software written in SAS and C as well as the Python record linkage toolkit. Yves details the strengths and weaknesses of these and identifies which are most effective in the context of the multiple record linkage applications and the mission of the Census Bureau. Along the way, he covers issues of speed and functionality in various modern environments (linking of business list, census roasters, etc.), as well as the difficult issue of specifying and estimating error bounds for the linked records and missed links.

Photo of Yves Thibaudeau

Yves Thibaudeau

US Census Bureau

Yves Thibaudeau is a mathematical statistician and principal researcher at the US Census Bureau. His publications include a book chapter on record linkage and computer matching in the Annals of Applied Statistics. He’s given a number of presentations on statistics and record linkage at conferences since 1988. Yves holds a PhD and an MS in statistics from Carnegie Mellon and a BSc in mathematics from McGill.