Get the free Ebook:
Private and Open Data in Asia: A Regional Guide.
Dirty duplicate data plagues businesses of all sizes. For accurate analytics and predictions, it is critical to keep the underlying data free of duplicates. However, business data suffers from various errors, omissions and typographical errors. This makes finding duplicates extremely hard. Comparing each record with every other record makes the problem quadratic in nature. Hard to define similarity rules, multiple domains and languages also make the problem of entity resolution and record linkage extremely challenging.
In this session, we would explain UBM Asia’s use case for entity resolution and the challenges their data posed. To ensure accurate and efficient marketing, UBM Asia has created a deduplication solution employing Reifier fuzzy matching engine. We will discuss this solution and handling of duplicate customer records using Apache Spark.
We will also discuss how Reifier leverages Spark’s machine learning, distributed and in memory capabilities to create training data for matching, learn similarity and indexing rules from the training data and apply that knowledge to cleanse, deduplicate and link records across different domains and entities.
Dave Chan, CBIP, is a business analytics practitioner with over a decade of experience implementing big data projects for retail banking, healthcare and media organizations. In his current role, Dave leads a team of analysts that help the business democratize data and build data-driven products.
Dave speaks in big data, predictive analytics and data visualization conferences in Asia and the US. He graduated from University of Illinois at Urbana-Champaign as a James Scholar in Electrical Engineering.
Sonal is the founder and CEO at Nube Technologies (www.nubetech.co), a startup focussed on big data preparation and analytics. Nube Technologies builds business applications for better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data.
By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.