Real estate databases are geospecific (e.g., East Bay, North Bay, South Bay, etc). If a house to be put up for sale is located close to the geoboundary, a real estate listing agent will often list it in both databases. For example, a house located in Milpitas, CA, would often be listed in both East Bay and South Bay databases, although the content of both database entries may be different to appeal to the different demographics of each area. Real estate brokerage firms enter in cross-area sharing agreements, and there are efforts underway to create a nationwide sharing framework as well. Herein lies the problem: when data feeds from East Bay and South Bay databases are aggregated, this results in duplicate listings.
Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology in TensorFlow 1.2. The curated dataset of over 3,000 images includes images of the front of the houses provided by MLS Listings, Inc., which contains entries (sets of JPEG images) that are a priori known to belong to duplicate real estate listings as well as those which are distinct. Before embarking on building a convolutional neural network, Sergey, Olga, and her team attempted a brute-force approach using a 1-nearest neighbor algorithm to establish a baseline for accuracy and precision of the prediction. Sergey and Olga explain why the brute-force nearest-neighbor approach was inadequate and how the CNN Siamese network was able to achieve accuracy of 69% and precision of 92%. To demonstrate that the implementation scales well with increased dataset, they also describe an implementation of the same CNN Siamese network in Spark’s BigDL framework and compare the results with those of TensorFlow.
Sergey Ermolin is a software solutions architect for deep learning, Spark analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since 1996, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard. Sergey holds an MSEE and a certificate in mining massive datasets from Stanford and BS degrees in both physics and mechanical engineering from California State University, Sacramento.
Olga Ermolin is a senior business intelligence engineer at MLS Listings, where she is responsible for standardizing the schema of the company’s real estate database across multiple real estate hosting companies as well as maintaining day-to-day data integrity and scalability. She is designing and developing various Business Intelligence tools that enable clients to visualize and analyze real estate trends and performances. scalability. She created the company’s BI product that enables clients to visualize and analyze real estate trends and performance.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org