Monday, December 11, 2017

Predicting House Prices

Introduction

We all know that when it comes to housing prices, the most important thing is location, location, location.  But what else matters?  A Kaggle competition asks that very question (see here for details).  Using data from home sales in Ames, Iowa, the competition asks us to predict the final sales price.  Variables to use in that prediction include features such as year built, square footage, number of bathrooms, number of bedrooms, central heating, and other potentially important factors.

Analysis

So what does matter?  I entered the competition as part of a final project for a class in my data science Master's program.  My full process and results as part of that class are located here for you take a look at:
But here is quick summary if you want the short version of the results:
  • My best model had an R^2 value of 0.935.  While this sounds great, it only got me barely into the top 50% in rank.  Kaggle competitions are hard.
  • The most important variables (according to my models) in predicting sales price are:
    • how much square footage does the house have? More is better.
    • what kind of roof does it have? Certain kinds (mebrane, metal) are better than others (Composite Shingle, Woodshake).
    • what is the overall condition of the house? A higher ranking is better.
    • zoning (commercial zoning lowers the price)
    • overall quality (higher is better)
    • basement square footage (more is better)
    • lot area (more is better)
If I had more time to complete my analysis, what else would I do?  At the very least, I would try to do more feature engineering and would break the given training data into training and testing sets so that I could test my model generalization and correct overfitting before submitting my predictions to Kaggle.  I would also normalize the variables so that I could better understand the impact of each variable, comparing coefficients to see which variables contribute the most to the overall price, and not just which ones correlate the most.

But with limited time and an open ended problem to "solve", one must draw a line somewhere and say "good enough" to avoid increasingly diminishing returns on time and effort.  And since my solution was "good enough" to help me really understand the nature of the problem and to complete my final project, I left my model as is after breaking the 50% mark.

Predicting My House

Most of you will know that some websites already provide an estimate of how much your house is worth based on factors such as those considered in the Kaggle competition (e.g., Zillow's Z-estimate).  In August 2010, my house was listed for $300K by the previous owner.  As I live in the Seattle, WA housing market, prices are definitely NOT typical compared to other places in the US (and have only gotten worse since 2010, or better depending on your point of view).

What would my model predict the SalePrice of my house to be?  My house is NOT in Ames, Iowa, so the most important factor in determining house price (location, location, location) is different and is not being accounted for.  Still, I was curious, so I filled in all of the variables for my house (e.g., square footage) to see what my model predicted my house would have been sold for if it had been in Ames, Iowa in 2010.

After filling in the data for my house, my model predicted a value of $168,455.  When I manually looked at the training data for similar houses to mine, the average price was comparable ($160K-$180K).  So my model seems to be predicting close to what I would expect for my house IF it were in Ames.  I take the $130K difference therefore to be the price due to a difference in location, location, location at that time.  I'm sure that this difference is now much greater.

Conclusion

None of what my analysis revealed is really surprising.  The factors that are important are things I would consider important when looking at a house.  Perhaps the Kaggle competition is best understood as being less about understanding what is important and more about understanding and quantifying how important these things are.  That is, which factors contribute the most value to the price of a house?

While I would need to do more work to more accurately assess the magnitude of impact that each variable has in determining the house price, a quick multiplication of the coefficient for a variable in the model by the average value of the variable can tell us roughly what the average impact of each variable is in contributing towards the final sales price.

The SalePrice for a house using average values and the coefficients in the model is about $160K.  Here are some of the most important variables  that make up the following percentage contributions towards that price:
  • Intercept: 59% (starting value for a house)
  • Roof Material - Composite Shingle: 23%
  • LotArea: 7.3%
  • 1st floor square footage: 1.9%
  • YearsOld: -4.0%
Composite shingle is the most common roof material, so this isn't really that interesting (one can think of this as generally adding to the intercept except in a few cases).  Unsurprisingly, lot area and square footage are important contributions towards adding value to the sale price, while years old removes value from the sale price. 

Again, not very surprising, although the fact that lot area appears to matter more than square footage in determining price is perhaps a little surprising.  But if one remembers that location, location, location is what matters most, then it makes sense that how much you own of that location, location, location may be more important than what is on that location, location, location.

To conclude, I hope that the above summary gives you some insight into what makes a house valuable and why that might be the case.  I believe my analysis matches our intuitions about what determines a sales price for a house, so if nothing else, it confirms what we already intuitively know, and gives us a way to quantify those intuitions.

Thanks!

No comments:

Post a Comment