In the field of real estate, the idea of predicting the “right price” for a property is growing heavily in interest. Most current algorithms solely use statistical information about given properties as a form of input to predict its right price. However, these algorithms fail to include a notable form of data that often influences the perception of a buyer: images of the house. Fortunately, in recency, convolutional neural networks (CNNs) have increased in prominence for their ability to generate strong feature representations out of images and use those representations to accurately map to scalar/vectorized outputs.
Yet, as explored in the 2021 project, creating a dataset of strongly-labelled pictures and corresponding statistical information for homes in the Greater Toronto Area (GTA) w.r.t. houses sold at a single point in time is a difficult task. A MAPE error of 15% was generated through extensive hyperparameter tuning on a dataset of 500 houses. However, this number is far behind current MAPE benchmarks developed through numerical datasets only. Furthermore, on the numerical side of things, high quality GTA residential datasets that contain a wide variety of house attributes are far and few to be found.
The lack of high-quality GTA datasets lead to the core goal of this sequel RealValue project, which is to create a dataset centered around GTA homes that is extensive, usable and informative. The metric to measure success in executing this goal is the MAPE that different experiments are able to generate.
The project is considered a success if the project achieves 7.5% MAPE error, which is 1/2 of the best error from the 2021 project.
Data Collection Process The data collection process is an iterative process that involves fetching the house data through an API request, saving this data into the database, and updating the data with more detailed house information and amenities periodically. This process is an automated process and executes every 12 hours to ensure the database is always giving up-to-date and reliable information. To automate this process, GitHub Actions were used to create Docker container actions. The actions were operated on a Google Cloud Virtual Machine, ensuring that the hardware is consistent and can reliably update the database. The database of choice is MongoDB due to its fast data access for read/write operations.
Dataset The final dataset is comprised of over 30,000 homes across Toronto. Each contains numerical and categorical attributes such as number of bedrooms, bathrooms, and type of house, as well as over 40 amenities, including nearby public transit, restaurants, schools. This is a major improvement upon last year’s project, which only consisted of 157 homes with very limited corresponding attributes. However, 65% of the dataset did not possess corresponding images, resulting in a different training approach.
Training was conducted in two parts:
Training without images involved testing 8 different regressor models
Training with images involved testing the best of the 8 regressors (regression tree ensemble) against a CNN network (EfficientNetB0). EfficientNetB0 was chosen due to its high accuracy on image classification problems
Without images
With images
We are continuing this project as RealTime-2 in 2022, where we will aim to beat the state-of-the-art MAPE score of 7.5% (Real-ime MAPE: 9%). We will be focusing on making request allocations more efficient to serve more homes and drawing upon further data sources and integrating more APIs to build a wider dataset while maintaining a high length. For more details, please visit https://utmist.gitlab.io/projects/realtime2/.