Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Customer record deduplication using Spark and Reifier

Dave Chan (UBM Asia), Sonal Goyal (Nube)
4:50pm–5:30pm Thursday, 12/03/2015
Hadoop & Beyond
Location: 334-335 Level: Intermediate
Average rating: ****.
(4.00, 7 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

Attendees will benefit from prior exposure or awareness to Apache Spark. We will cover some basic features for better understanding.


Dirty duplicate data plagues businesses of all sizes. For accurate analytics and predictions, it is critical to keep the underlying data free of duplicates. However, business data suffers from various errors, omissions and typographical errors. This makes finding duplicates extremely hard. Comparing each record with every other record makes the problem quadratic in nature. Hard to define similarity rules, multiple domains and languages also make the problem of entity resolution and record linkage extremely challenging.

In this session, we would explain UBM Asia’s use case for entity resolution and the challenges their data posed. To ensure accurate and efficient marketing, UBM Asia has created a deduplication solution employing Reifier fuzzy matching engine. We will discuss this solution and handling of duplicate customer records using Apache Spark.

We will also discuss how Reifier leverages Spark’s machine learning, distributed and in memory capabilities to create training data for matching, learn similarity and indexing rules from the training data and apply that knowledge to cleanse, deduplicate and link records across different domains and entities.

Photo of Dave Chan

Dave Chan

UBM Asia

Dave Chan, CBIP, is a business analytics practitioner with over a decade of experience implementing big data projects for retail banking, healthcare and media organizations. In his current role, Dave leads a team of analysts that help the business democratize data and build data-driven products.
Dave speaks in big data, predictive analytics and data visualization conferences in Asia and the US. He graduated from University of Illinois at Urbana-Champaign as a James Scholar in Electrical Engineering.

Photo of Sonal Goyal

Sonal Goyal


Sonal is the founder and CEO at Nube Technologies (, a startup focussed on big data preparation and analytics. Nube Technologies builds business applications for better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data.

By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate.

Comments on this page are now closed.


Picture of Sonal Goyal
Sonal Goyal
11/18/2015 12:05am +08

Looking forward to interacting with the best minds – feel free to post here if you want me to cover something specific.