Fuzzy matching and deduplicating data: Techniques for advanced data prep
Who is this presentation for?
- Data engineers, data scientists, and analysts
Machine learning transforms can enable you to identify duplicate or linked records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. Machine learning transforms can help you with the following problems:
- Linking customer records across different customer databases, even when many customer fields don’t match exactly across the databases (e.g. different name spelling, address differences)
- Matching external product lists against your product catalog, such as lists of hazardous goods or lists of goods that can’t be transported by air
- Deduplicating customer accounts, when the same person makes multiple registrations
Nikki Rouda and Janisha Anand explain how they used machine learning transforms to find matching records from Abt.com’s and Buy.com’s lists of consumer electronic products from the Abt-Buy dataset. These lists include significant overlap in products, but neither IDs nor most names or descriptions match exactly.
- A basic understanding of data preparation and ETL
What you'll learn
- Learn how to match and deduplicate similar, but not exactly matching, records across datasets and how to simplify data preparation and cleansing
Amazon Web Services
Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.
Amazon Web Services
Janisha Anand is a senior business development manager for data lakes at AWS, where she focuses on designing, implementing, and architecting large-scale solutions in the areas of data management, data processing, data architecture, and data analytics.
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires