Real estate databases are geo-specific (eg. East Bay, North Bay, South Bay, etc). If a house to be put up for sale is located close to the geo boundary, a real estate listing agent will often list it in both databases. For example, a house located in Milpitas would often be listed in both East Bay and South Bay databases. The content of both database entries could be different to appeal to different demographics of each area. Real estate brokerage firms do enter in cross-area sharing agreements and there are efforts underway to create a nation-wide sharing framework as well. Herein lies the problem: when data feeds from EastBay and SouthBay databases are aggregated, this results in two duplicate listings. The purpose of the project is to provide means to identify and flag these duplicates for future removal.
Olga Ermolin details an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages transfer learning Siamese architecture based on VGG-16 CNN topology in TensorFlow 1.2. The curated dataset of over 3,000 images includes images of the front of the houses provided by MLS Listings, Inc., which contains entries (sets of JPEG images) that are a priori known to belong to duplicate real estate listings as well as those which are distinct. Before embarking on building a convolutional neural network, Olga and her team attempted a brute-force approach using a 1-nearest neighbor algorithm to establish a baseline for accuracy and precision of the prediction. Olga explains why the brute-force nearest-neighbor approach was inadequate and how the CNN Siamese network was able to achieve accuracy of 69% and precision of 92%. To demonstrate that the implementation scales well with increased dataset, Olga also describes an implementation of the same CNN Siamese network in Spark’s BigDL framework and compares the results with those of TensorFlow.
Olga Ermolin is a senior business intelligence engineer at MLS Listings, where she is responsible for standardizing the schema of the company’s real estate database across multiple real estate hosting companies as well as maintaining day-to-day data integrity and scalability. She created the company’s BI product that enables clients to visualize and analyze real estate trends and performance.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org