Fuzzy matching and deduplicating data - techniques for advanced data prep
Who is this presentation for?data engineers, data scientists, analysts
Machine learning transforms can enable you to identify duplicate or linked records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. Machine learning transforms can help you with the following problems:
• Linking customer records across different customer databases, even when many customer fields do not match exactly across the databases (e.g. different name spelling, address differences)
• Matching external product lists against your product catalog, such as lists of hazardous goods or lists of goods that can’t be transported by air
• Deduplicating customer accounts, when the same person makes multiple registrations
In this session, you will learn how to use machine learning transforms to find matching records between the from the Abt.com and Buy.com companies lists of consumer electronic products from the Abt-Buy dataset. These lists include significant overlap in products, but neither IDs, nor most names or descriptions match exactly.
Prerequisite knowledgeUnderstanding of data preparation and ETL
What you'll learn
Amazon Web Services
Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.
Amazon Web Services
Roy Hasson is a Sr Manager of Global Business Development for Analytics and Data Lakes at Amazon Web Services, where he helps transform organizations using data. Roy serves as an expert resource on big data architectures, data lakes and machine learning. Previously at AWS, Roy served as a Technical Account Manager leading strategy and supporting implementation of data architectures with select customers. Prior to AWS, Roy spent 15 years working with tier 1 service providers to design and deploy large data and telephone network systems.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts