Sep 23–26, 2019

Fuzzy matching and deduplicating data: Techniques for advanced data prep

Nikki Rouda (Amazon Web Services), Roy Hasson (Amazon Web Services)
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 07/08

Who is this presentation for?

data engineers, data scientists, analysts




Machine learning transforms can enable you to identify duplicate or linked records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. Machine learning transforms can help you with the following problems:
• Linking customer records across different customer databases, even when many customer fields do not match exactly across the databases (e.g. different name spelling, address differences)
• Matching external product lists against your product catalog, such as lists of hazardous goods or lists of goods that can’t be transported by air
• Deduplicating customer accounts, when the same person makes multiple registrations
In this session, you will learn how to use machine learning transforms to find matching records between the from the and companies lists of consumer electronic products from the Abt-Buy dataset. These lists include significant overlap in products, but neither IDs, nor most names or descriptions match exactly.

Prerequisite knowledge

Understanding of data preparation and ETL

What you'll learn

How to match and deduplicate similar, but not exactly matching, records across data sets Simplifying data preparation and cleansing
Photo of Nikki Rouda

Nikki Rouda

Amazon Web Services

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Photo of Roy Hasson

Roy Hasson

Amazon Web Services

Roy Hasson is a Sr Manager of Global Business Development for Analytics and Data Lakes at Amazon Web Services, where he helps transform organizations using data. Roy serves as an expert resource on big data architectures, data lakes and machine learning. Previously at AWS, Roy served as a Technical Account Manager leading strategy and supporting implementation of data architectures with select customers. Prior to AWS, Roy spent 15 years working with tier 1 service providers to design and deploy large data and telephone network systems.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts