The US Census Bureau has been involved in record linkage (a.k.a. entity resolution) projects for over 40 years. In that time, there’s been a lot of change in computing capabilities and new techniques, as well as important new developments in machine learning algorithms and data science to support and improve record linkage processes. The Census Bureau is reviewing an inventory of linkage methodologies, including multiple homegrown methods and software packages, as it embarks on ever more challenging record linkage projects.
Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications and offers an overview of solutions, such as the homegrown linkage software BigMatch, which implements multikey quicksorting of character strings and is believed to be among the fastest software. (BigMatch is written in the low-level programming language C and is expected to be very efficient as the compiling and translating process is minimum.) Other packages under review include other Census Bureau software written in SAS and C as well as the Python record linkage toolkit. Yves details the strengths and weaknesses of these and identifies which are most effective in the context of the multiple record linkage applications and the mission of the Census Bureau. Along the way, he covers issues of speed and functionality in various modern environments (linking of business list, census roasters, etc.), as well as the difficult issue of specifying and estimating error bounds for the linked records and missed links.
Yves Thibaudeau is a mathematical statistician and principal researcher at the US Census Bureau. His publications include a book chapter on record linkage and computer matching in the Annals of Applied Statistics. He’s given a number of presentations on statistics and record linkage at conferences since 1988. Yves holds a PhD and an MS in statistics from Carnegie Mellon and a BSc in mathematics from McGill.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com