Predicting real estate is like any other geospatial problem - all you need is data - e.g. see Zillow. If locations such as cities and their house prices or increases are correlated with environmental variables then a model can be developed. Here I address the question - what environmetal variables predict the increase in house prices in metro areas of the US? Applying WhyWhere (WW) to a topic other than predicting species produces some interesting results: the best predictor of areas with high or low price increases is precipitation (contrary to temperature as is usually thought), and the response shows a strongly bimodal distribution. This example illustrates the generality of the WW approach to environmental niche modeling (ENM).
WW can search almost 1000 available environmental data images such as the precipitation variable above, create a model based on the best one to three and display the predicted probabilities on maps colored according to the adjacent legend.
Accuracy is determined in each category by the relative proportion of those environmental cells, i.e. there is a frequency of cells for each environmental category based on the number of cells with that value in the grid. If occurrence of points to be predicted are random they would have the same distribution of environmental values as the variable as a whole. But if the proportion is greater or lesser than expected, this is a basis for predicting presence or absence. See the 1D histograms in Figure 1 for an example. The variables selected are those where the frequency differs most from the environment, giving categories with highly differing proportions overall, and hence more accurate predictions.
Although analysis of real estate must be somewhat sub-optimal using climate variables and other variables from species modeling, this experiment is also an opportunity to illustrate various effects of different modeling approaches: of masking, (P) presence-only and presence/absence (PA) analysis.
Preparation of data
The National Association of Realtors publishes housing statistics for metropolitan areas here. The data file was an excel spreadsheet of the form:
Metropolitan Area 2002 2003 2004 2004:III 2004:IV 2005:I 2005:II r 2005:III p %Chya
"Allentown-Bethlehem-Easton, PA-NJ", 161.1, 184.7, 207.3, 222.6, 210.5, 214.8, 242.7, N/A, N/A
To this we have to get the decimal coordinates to do the P predictions in the form.
-117.425 47.65888889
-74.42333333 39.36416667
or in the case of PA prediction:
-117.425 47.65888889 0
-74.42333333 39.36416667 1
I used a free server at http://geocoder.us for obtaining these coordinates. This server returns the latitude & longitude of any US address. Here is an example of a query for Phoenix, AZ where price of houses increases 55% last year:
POST: http://geocoder.us/service/csv/geocode?city=Phoenix&state=AZ
RETURNS: -112.0733333, 33.44833333, Maricopa, Phoenix, AZ
The resulting spreadsheet is here. Using these data I extracted coords of metro areas with %Chya > 20 and pasted 24 of those points into WW at the Get_Points stage. The GIS was All_Terrestrial consisting of 205 climatic, hydrographic, landscale and other assorted variables at a resolution of 0.1 degrees. All the data were used in the prediction, as with this small number of points, and there would have been substantial variation between variables selected due to subsetting of points. Had I been interested in statistical skill I would have repeated the analysis a number of times, estimating accuracy on a ‘held back’ set of points.
Results
P data and no mask
The results for P data and not mask were as follows:
Environmental Data for .22.90 from All_Terrestrial
22. lcprc02 Leemans and Cramer February Precipitation (mm/month)
Range 0 to 652 millimeters/month
90. lwmpr06 Legates & Willmott June Measured Precipitation (mm/month)
Range 0 to 1129 millimeters/month
Accuracy 0.921
Figure 1. Predictions of price increases >20% using P data and no mask (1D|Hist|2D|Hist).
The first point is that because the ocean is not masked the model has to predict these areas as absences. The environmental category for ocean can be seen in the histogram of environmental categories below. Variable for precipitation - lcprc02 - shows the value is constant across the ocean allowing it to be treated as a single environmental category, in which there are no house price points.
P data
Normally in a terrestrial analysis the oceans are masked out and play no part in the analysis. This is because, firstly, we are not interested in them, and secondly, if included, the first variable selected often must distinguish between ocean and land. When I ran WW with the ocean mask the result was:
Environmental Data for .185.73 from All_Terrestrial
185. alt Altitude
Range to percentage
73. lwcsd02 Legates & Willmott February Corrected Precipitation (std. dev.)
Range 0 to 176 millimeters/month
Accuracy 0.85
Figure 2. Predicted price increases >20% using P only and ocean mask (1D|Hist|2D|Hist).
The first point is that masking ocean gives different variables. Altitude is the first variable selected, possibly because metro areas with high growth have low altitude. However, it could also be because most metro areas are low altitude. We are primarily interested in areas that distinguish between the metro areas, so we do a PA analysis.
PA data
Here we do a PA analysis. For this we have to paste in coordinates of all metro areas with a 1 or 0 depending on P or A. We rerun the analysis and obtain:
Environmental Data for .195.84 from All_Terrestrial
195. bio_18 Precipitation of Warmest Quarter
Range to percentage
84. lwmpr00 Legates & Willmott Annual Measured Precipitation (mm/year)
Range 0 to 6434 millimeters/year
Accuracy 0.818
Figure 3. Predicted price increases using PA data (1D|Hist|2D|Hist).
The interesting point is that precipitation is found to be the most accurate and is bimodal in form. High price increases occurred in areas of either high summer or low summer precipitation. To investigate a little further and verify the results I ran WW on a smaller data set of climate variables called Climate_Ann_Ave consisting of annual temperature, precipitation and standard deviations. The accuracies and maps are as follows:
Environmental Data for .0 from Climate_Ann_Av
0. lwcpr00 Legates & Willmott Annual Corrected Precipitation (mm/year)
Range 0 to 6626 millimeters/year
Accuracy 0.787
Table 1. Accuracies for each single variable from Climate_Ann_Av.
| Variable | Code | Accuracy |
|---|---|---|
| Legates & Willmott Annual Corrected Precipitation (mm/year) | lwcpr00.pgm | 0.796 |
| Legates & Willmott Annual Corrected Precipitation (std. dev.) | lwcsd00.pgm | 0.672 |
| Legates & Willmott Annual Standard Error (mm/year) | lwerr00.pgm | 0.745 |
| Legates & Willmott Annual Measured Precipitation (mm/year) | lwmpr00.pgm | 0.782 |
| Legates & Willmott Annual Measured Precipitation (std. dev.) | lwmsd00.pgm | 0.688 |
| Legates & Willmott Annual Temperature (0.1C) | lwtmp00.pgm | 0.753 |
| Legates & Willmott Annual Temperature (std. dev.) | lwtsd00.pgm | 0.652 |
Annual precipitation is once again a better predictor than temperature albeit with lower accuracy (0.79 vs. 0.85) than the seasonal or monthly precipitation variables.
Figure 4. Predicted price increases greater than 20% using Climate_Ann_Ave on P data (1D|Hist).
As a further test I looked at the prediction of those 78 metro areas that had an increase of less than 10% per year. The results were as follows:
Environmental Data for .3 from Climate_Ann_Av
3. lwmpr00 Legates & Willmott Annual Measured Precipitation (mm/year)
Range 0 to 6434 millimeters/year
Accuracy 0.732
Figure 5. Predicted price increases less that 10% (1D|Hist).
As expected the precipitation is again most accurate and is no longer bimodal in form. Low growth in prices occurred in areas of moderate precipitation, with high variance in areas of high precipitation. This is a response pattern more familiar from predicting species.
Discussion
This analysis illustrates a number of points about ENM. It is clear that precipitation is a great predictor of recent house price increases from a statistical point of view. This result differs from most commentaries that regard the real estate expansion related to temperature, due to growth in Florida and the SW. The increase in markets in the NE and NW really contradicts this view. Also many high temperature regions are not increasing.
Secondly, the histograms of response show growth is in two groups, high and low rainfall. That a bimodal variable is identified as the best predictor shows the power of the method. If we had tried to fit a uni-modal curve, as is frequently the case with species data, we would not have identified a bimodal response like this requiring at least a cubic curve. Obtaining a bimodal response, shows a more general approach is required for ENM in general problems. House prices are bimodal either because they are more complex, i.e. it may be that they have advanced for in two different regions for two different reasons, or simply less well modeled by the available variables. It could also be thought that the metro areas with intermediate growth in the mid-continent represent an actual stable niche, and the areas with precipitation extremes have more forces creating unstable prices, such as building space limitations or high building material prices.
The next question is, what the future holds for prices, or more to the point, what is a rational approach to predicting prices? One approach is to assume the same next year as last year, whereby the faster appreciating areas will continue to appreciate at a faster rate. Another approach would suggest that the market will ‘fill in’ in those areas that have neighbouring high appreciation. This would suggest under-appreciated areas in high rainfall areas in the SE that show on Figure 4 as high variance (red and blue checkerboard effect) will be prime for appreciation.

3 responses so far ↓
1 ENM » Surprising finding #2 // Apr 19, 2006 at 7:14 pm
[...] Surprising finding #2 Filed under: Uncategorized — admin @ 7:14 pm By Landshape.org In an earlier post on the spatial analysis of increasing house prices in the US, I used the small set annual climate variables (7) and found that precipitation rather than temperature was a better predictor of metropolitan areas with increases greater than 20% in median price in 2005. Here I have run the analysis again using the new version of WhyWhere and the entire set of available terrestrial variables (All_Terrestrial). This time the best variable was etopo-terr (accuracy = 0.80), a raw elevation variable. The response graph shows the highest appreciation is in the category of lowest altitudes (tallest red column with frequency of background in blue). I think this is a more sensible results that achieved using climate variables, as appreciation has been well known to have been in coastal areas. The result is surprising as it illustrates that the WhyWhere approach generalizes to prediction of things other than biological species. An analysis on a non-biological set of data points will give a reasonable explanation when data-mining a large dataset of environmental variables. By comparison, use of the standard set of climate variables would give non-intuitive, if not spurious results. [...]
2 Surf » WhyWhere 2.0 server // Apr 26, 2006 at 1:15 pm
[...] Example: Predicting House Prces [...]
An excellent read, i’ve looked over the national association of realtors site as well and there are some fundamentaly valid points there as well. What throws it all into a tailspin is that markets are ultimately raised or dropped by the population. No math model can predict what event will turn a place into the “place to be” completely but this sure helps when that little part of the equation is left out. Good stuff!
Leave a Comment