Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Sergey Ermolin (Intel), Olga Ermolin (MLS Listings)
16:3517:15 Wednesday, 23 May 2018
Big data and data science in the cloud, Data science and machine learning
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions, Media, Advertising, Entertainment
Average rating: ****.
(4.00, 1 rating)

Who is this presentation for?

  • Engineers, technical managers, and project managers

Prerequisite knowledge

  • Basic knowledge of feed-forward CNN and distributed computing principles

What you'll learn

  • Learn how to apply CNNs for image similarity tasks in real estate


Real estate databases are geospecific (e.g., East Bay, North Bay, South Bay, etc). If a house to be put up for sale is located close to the geoboundary, a real estate listing agent will often list it in both databases. For example, a house located in Milpitas, CA, would often be listed in both East Bay and South Bay databases, although the content of both database entries may be different to appeal to the different demographics of each area. Real estate brokerage firms enter in cross-area sharing agreements, and there are efforts underway to create a nationwide sharing framework as well. Herein lies the problem: when data feeds from East Bay and South Bay databases are aggregated, this results in duplicate listings.

Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology in TensorFlow 1.2. The curated dataset of over 3,000 images includes images of the front of the houses provided by MLS Listings, Inc., which contains entries (sets of JPEG images) that are a priori known to belong to duplicate real estate listings as well as those which are distinct. Before embarking on building a convolutional neural network, Sergey, Olga, and her team attempted a brute-force approach using a 1-nearest neighbor algorithm to establish a baseline for accuracy and precision of the prediction. Sergey and Olga explain why the brute-force nearest-neighbor approach was inadequate and how the CNN Siamese network was able to achieve accuracy of 69% and precision of 92%. To demonstrate that the implementation scales well with increased dataset, they also describe an implementation of the same CNN Siamese network in Spark’s BigDL framework and compare the results with those of TensorFlow.

Photo of Sergey Ermolin

Sergey Ermolin


Sergey Ermolin is a software solutions architect for deep learning, Spark analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since 1996, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard. Sergey holds an MSEE and a certificate in mining massive datasets from Stanford and BS degrees in both physics and mechanical engineering from California State University, Sacramento.

Photo of Olga Ermolin

Olga Ermolin

MLS Listings

Olga Ermolin is a senior business intelligence engineer at MLS Listings, where she is responsible for standardizing the schema of the company’s real estate database across multiple real estate hosting companies as well as maintaining day-to-day data integrity and scalability. She is designing and developing various Business Intelligence tools that enable clients to visualize and analyze real estate trends and performances. scalability. She created the company’s BI product that enables clients to visualize and analyze real estate trends and performance.