Sep 23–26, 2019
Please log in

Fuzzy matching and deduplicating data: Techniques for advanced data prep

Nikki Rouda (Amazon Web Services), Janisha Anand (Amazon Web Services)
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 07/08
Average rating: *....
(1.71, 14 ratings)

Who is this presentation for?

  • Data engineers, data scientists, and analysts




Machine learning transforms can enable you to identify duplicate or linked records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. Machine learning transforms can help you with the following problems:

  • Linking customer records across different customer databases, even when many customer fields don’t match exactly across the databases (e.g. different name spelling, address differences)
  • Matching external product lists against your product catalog, such as lists of hazardous goods or lists of goods that can’t be transported by air
  • Deduplicating customer accounts, when the same person makes multiple registrations

Nikki Rouda and Janisha Anand explain how they used machine learning transforms to find matching records from’s and’s lists of consumer electronic products from the Abt-Buy dataset. These lists include significant overlap in products, but neither IDs nor most names or descriptions match exactly.

Prerequisite knowledge

  • A basic understanding of data preparation and ETL

What you'll learn

  • Learn how to match and deduplicate similar, but not exactly matching, records across datasets and how to simplify data preparation and cleansing
Photo of Nikki Rouda

Nikki Rouda

Amazon Web Services

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Photo of Janisha Anand

Janisha Anand

Amazon Web Services

Janisha Anand is a senior business development manager for data lakes at AWS, where she focuses on designing, implementing, and architecting large-scale solutions in the areas of data management, data processing, data architecture, and data analytics.

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires