Welcome! I intend this to be an ongoing project of predicting NFL game outcomes, point spreads, and final scores. My hope is that my models will improve over time as they become more sophisticated and use better data. I will try to regularly publish predictions to keep myself accountable, and will make observations along the way about what is working and what isn't. See below for latest updates. Enjoy!
Previous 2020 Predictions: _________________________________________________________________________________
Recap:
11-4. The models certainly did better. The Steelers-Titans game was postponed due to Covid 19 illnesses and will be played in week 7 after some schedule reshuffling. All other games were played, although the Patriots had to play without their starting QB. The logistic model had an expected value of 9.38 while the random forest model had an expected value of 8.78, so actual performance this week was higher than expected.
Of the oddities and bold predictions, 3 of the 4 games went my way. Only the Bengals won against my model predictions. Upsets included the 49ers loss to the Eagles and the stunning victory of the Browns over the Cowboys.
On to week 5!
_________________________________________________________________________________
Week 4:
Week 4 brings an additional new model: a random forest. A random forest model was chosen because the outcome we are trying to predict is categorical ("win" or "lose"). These models are easy to train (although not as fast as the logistic regression model) and are known to be very powerful and accurate with many decision trees in the forest. They also automatically perform feature selection and are easy to interpret.
Perhaps a more important point, a logistic model assumes that the variables are linearly related to the target, and I suspect that this is not true. Said differently, a logistic model roughly assumes that variable x + variable y + variable z = target A. However, football may not work that way. It may not be the case that having a certain winratio + years since the playoffs - starters injured = "win".
Instead, the relationship may be more like, "if starting QB is injured->"lose", else, if win ratio is greater than 0.9->"win", else, if years since playoff is greater than 5-> "lose"...." That is, various combinations of variables may lead to a win or loss, but the specific combination matters, and changing any values in that combination even slightly can lead to a different result. In short, we would have a non-linear relationship between the variables and the target.
A non-linear relationship is exactly the sort of relationship that a random forest model assumes. Consequently, it may perform much better than a logistic regression model. I intend to run both models in parallel to see how each does with the hope that the random forest model is in fact better and can be used as my production model in future weeks and years.
The random forest model perform largely the same on training and test data as the logistic regression model with 72% accuracy. Will it perform better than logistic regression in practice? I hope so.
Which variables matter most in the random forest model? In order:
- Rolling winratio over past 16 games
- Opponent's years since the playoffs
- Team's average point difference with opponents
- Rolling winratio over past 32 games
- Opponent prior year points for
- Average team score
- Average opponent rushing yards
- Team's years since playoffs
- Average opponent's total yards
- Average opponent's passing yards
I notice that although there are some important variables that overlap with the prior logistic regression model (e.g., winratio, years since the playoffs), there are more traditional stats in this model as well (e.g., passing yards) and it seems to be less redundant. Hopefully that means it can provide more nuance and better predictions overall.
I also retrained the logistic regression model using automated feature selection and interestingly, it now includes other more traditional stats as well. So hopefully it will also do better than previously.
Now for the predictions.
Oddities and Bold Predictions:
Before retraining the logistic regression model, I noticed some glaring differences between the random forest model and the previous logistic regression model. While most predictions were the same, some obvious discrepancies along with the poor performance thus far encouraged me to retrain the model. After retraining, both the logistic regression model and the random forest model have the same picks for this week, which is convenient and a little strange. But hopefully that means better performance going forward.
- Bears vs. Colts
- Both models pick Colts to win. But Bears are 3-0 and playing at home. while Colts are 2-1. Both have played mediocre teams so far. I'm kinda leaning towards the Bears personally. 538 also picks the Bears.
- Bengals vs. Jaguars
- The old model predicted a Bengal's victory while the new models pick the Jaguars. Bengals are 0-2-1 and Jaguars are 1-2 but against weaker teams. Not really inclined either way. 538 picks the Bengals to win.
- Buccaneers vs. Chargers
- The old model had Buccaneers losing, which is clearly the wrong prediction. The new models make the "right" choice in my opinion by picking the Buccaneers. 538 agrees.
- Texans vs. Vikings
- The old model barely picked the Texans to win, but the new models barely pick the Vikings to win. Texans play at home. Both are 0-3 but both have played good teams. No personal inclination on this one. 538 picks the Texans to win.
Good luck in week 4!
Here are the predictions for Week 4: