Fuzzy matching and deduplicating data: Techniques for advanced data prep
Who is this presentation for?
- Data engineers, data scientists, and analysts
Level
Description
Machine learning transforms can enable you to identify duplicate or linked records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. Machine learning transforms can help you with the following problems:
- Linking customer records across different customer databases, even when many customer fields don’t match exactly across the databases (e.g. different name spelling, address differences)
- Matching external product lists against your product catalog, such as lists of hazardous goods or lists of goods that can’t be transported by air
- Deduplicating customer accounts, when the same person makes multiple registrations
Nikki Rouda and Janisha Anand explain how they used machine learning transforms to find matching records from Abt.com’s and Buy.com’s lists of consumer electronic products from the Abt-Buy dataset. These lists include significant overlap in products, but neither IDs nor most names or descriptions match exactly.
Prerequisite knowledge
- A basic understanding of data preparation and ETL
What you'll learn
- Learn how to match and deduplicate similar, but not exactly matching, records across datasets and how to simplify data preparation and cleansing
Nikki Rouda
Amazon Web Services
Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.
Janisha Anand
Amazon Web Services
Janisha Anand is a senior business development manager for data lakes at AWS, where she focuses on designing, implementing, and architecting large-scale solutions in the areas of data management, data processing, data architecture, and data analytics.
Presented by
Elite Sponsors
Strategic Sponsors
Zettabyte Sponsors
Contributing Sponsors
Exabyte Sponsors
Content Sponsor
Impact Sponsors
Supporting Sponsor
Non Profit
Contact us
confreg@oreilly.com
For conference registration information and customer service
partners@oreilly.com
For more information on community discounts and trade opportunities with O’Reilly conferences
strataconf@oreilly.com
For information on exhibiting or sponsoring a conference
pr@oreilly.com
For media/analyst press inquires