Finding different records that represent the same real-world entity is very difficult, but also very useful. All non-trivial databases contain duplicates, and these will inevitably affect the quality of analysis and reporting. Similarly, often one wants to connect records in different data sets in order to enrich the data set. For this reason, statisticians started creating techniques for linking data as early as the 1940s, and by now this is a mature and well-established field with lots of useful techniques and tools.
This talk explains the basics of record linkage (as it’s called), such as cleaning input data, advanced fuzzy string comparators like Levenshtein, affine gaps, Jaro-Winkler, and much more. It then introduces an open source tool in Java, called Duke, for implementing this in practice using Bayesian statistics, and shows how it can be used to solve real-world problems like deduplicating customer databases and connecting data sets.
At the end there will be a brief demo of how genetic algorithms with active learning can be used to automatically create linking configurations.
I’m a consultant, switching between the roles of developer, architect, and advisor, focusing mostly on semantic technology and data integration. Looking at getting into Big Data analytics to derive more value from the data we’ve integrated. Developer of an open source data linking tool called Duke .
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.