Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Olga Ermolin (MLS Listings)
16:3517:15 Wednesday, 23 May 2018
Big data and data science in the cloud, Data science and machine learning
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions, Media, Advertising, Entertainment

Who is this presentation for?

  • Engineers, technical managers, and project managers

Prerequisite knowledge

  • Basic knowledge of feed-forward CNN and distributed computing principles

What you'll learn

  • Learn how to apply CNNs for image similarity tasks in real estate


Real estate databases are geo-specific (eg. East Bay, North Bay, South Bay, etc). If a house to be put up for sale is located close to the geo boundary, a real estate listing agent will often list it in both databases. For example, a house located in Milpitas would often be listed in both East Bay and South Bay databases. The content of both database entries could be different to appeal to different demographics of each area. Real estate brokerage firms do enter in cross-area sharing agreements and there are efforts underway to create a nation-wide sharing framework as well. Herein lies the problem: when data feeds from EastBay and SouthBay databases are aggregated, this results in two duplicate listings. The purpose of the project is to provide means to identify and flag these duplicates for future removal.

Olga Ermolin details an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages transfer learning Siamese architecture based on VGG-16 CNN topology in TensorFlow 1.2. The curated dataset of over 3,000 images includes images of the front of the houses provided by MLS Listings, Inc., which contains entries (sets of JPEG images) that are a priori known to belong to duplicate real estate listings as well as those which are distinct. Before embarking on building a convolutional neural network, Olga and her team attempted a brute-force approach using a 1-nearest neighbor algorithm to establish a baseline for accuracy and precision of the prediction. Olga explains why the brute-force nearest-neighbor approach was inadequate and how the CNN Siamese network was able to achieve accuracy of 69% and precision of 92%. To demonstrate that the implementation scales well with increased dataset, Olga also describes an implementation of the same CNN Siamese network in Spark’s BigDL framework and compares the results with those of TensorFlow.

Photo of Olga Ermolin

Olga Ermolin

MLS Listings

Olga Ermolin is a senior business intelligence engineer at MLS Listings, where she is responsible for standardizing the schema of the company’s real estate database across multiple real estate hosting companies as well as maintaining day-to-day data integrity and scalability. She created the company’s BI product that enables clients to visualize and analyze real estate trends and performance.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)