Linking Data Without Common Identifiers

Data Science
Location: 115
Average rating: ****.
(4.73, 11 ratings)

Finding different records that represent the same real-world entity is very difficult, but also very useful. All non-trivial databases contain duplicates, and these will inevitably affect the quality of analysis and reporting. Similarly, often one wants to connect records in different data sets in order to enrich the data set. For this reason, statisticians started creating techniques for linking data as early as the 1940s, and by now this is a mature and well-established field with lots of useful techniques and tools.

This talk explains the basics of record linkage (as it’s called), such as cleaning input data, advanced fuzzy string comparators like Levenshtein, affine gaps, Jaro-Winkler, and much more. It then introduces an open source tool in Java, called Duke, for implementing this in practice using Bayesian statistics, and shows how it can be used to solve real-world problems like deduplicating customer databases and connecting data sets.

At the end there will be a brief demo of how genetic algorithms with active learning can be used to automatically create linking configurations.

Photo of Lars Marius Garshol

Lars Marius Garshol


I’m a consultant, switching between the roles of developer, architect, and advisor, focusing mostly on semantic technology and data integration. Looking at getting into Big Data analytics to derive more value from the data we’ve integrated. Developer of an open source data linking tool called Duke .