Wednesday, June 20, 2018

Cross Validation: An Experiment

The purpose of a predictive model is NOT to predict existing data well.  Predicting existing data well is the means, not the end, which is instead to predict future unseen data well.  Typically, in order to prevent creating a great fitting model that doesn't perform well on future data (i.e., overfitting), an existing data set is divided into a training set and a test set, where the model is trained on the training set and then tested against the test set to see how well it performs.  However, what if you don't have enough data to have a separate test set?  Or how do you get sense of the variation in your model's performance?  You may have gotten a luck test set...

Cross validation provides a way to test one's model while still using all of the data available to train it, and one can be confident that one didn't get lucky in the test data being used.  In short, the model is created using a subset of the data before being tested against the hold out subset of the data.  This can be done many times.  Typically, with K-folds cross validation, 10 "folds" are created, and a model is trained using 9 of the folds and tested against the hold out fold, before repeating again using the next combination, for a total of 10 different models.  The results are then averaged so one can how well one's model does "on average".

While 5 or 10 folds is typically standard, I wanted to see what would happen across a variety of different values for K.  Using the Boston data set found in the MASS package, I used the caret package to perform the cross validation.

To start, I created a baseline glm model to predict a target variable (whether crime per capita is above or below the median crime rate).  This gave me a model with an accuracy of 0.9249 with a 95% CI of (0.8971, 0.9471).  Next, I found the average accuracy for each model trained using 2 to 250 folds.  The 249 results had an average accuracy of 0.9117 with a 95% CI of (0.9056, 0.9177).

From the below chart one can see that at lower values of K there is a slightly lower accuracy (there is a vertical line at K = 10).  Beyond 50 folds the average seems to be pretty settled around 0.9125 with some variation between 0.900 and 0.925.



So what does this mean?  It suggests that our baseline model was a little too optimistic at an accuracy of 0.925, as this is the upper bound of how the cross validated models performed.  However, this result is still within the CI provided by our baseline model, so that is good to know and reassuring.  In any case, we have a more reliable assessment of how our model will perform on future unknown data.  As expected, it is lower than the training accuracy, but not by so much as to make the model unusable for future predictions.

Also, we notice that our estimated future data accuracy will be impacted by the number of folds we choose.  Too few folds will mean that we have not tested against enough hold out data sets to make sure our model will generalize well, and our future data accuracy may not be trustworthy.  On ther other hand, too many folds may produce wildly different models due to differences in each small sample (plus this may take a lot of computing power).  While I'd need more experience to generalize, I don't see a reason not to use the standard K = 10 as a good middle ground.  In my model, increasing the folds above 50 doesn't move the accuracy at all.  And when viewing the results on a plot with a y-axis of 0 to 1, there is no discernible difference.


To summarize, if you need a way to test your model's ability to generalize to future/test data and would like a sense of how well it does this, consider cross validation to create a more robust, stable, and reliable model. 

Models aside, testing one's hypotheses, beliefs, or claims about the world (i.e., the results from one's model) to see how they perform in the world of facts and experience (i.e., future data) is just good epistemic practice.  If we want our beliefs to be justified and well grounded, shouldn't we want our models to be also?  So just as you should "cross validate" your beliefs, make sure to justify your models as well.  Cross validation is an effective means of doing so.


References:
https://www.r-bloggers.com/cross-validation-for-predictive-analytics-using-r/

No comments:

Post a Comment