Wednesday, June 27, 2018

Independence of Irrelevant Possibilities and The 2016 US Presidential Election

The multinomial logit model depends on making the assumption of independence of irrelevant alternatives (IIA).  While there are different formulations, the general idea behind this principle is that when comparing two alternatives X and Y with a certain probability of preference in relationship to each other, the addition of an additional alternative Z shouldn't change the original relationship between X and Y.  That is, if X is favorable over Y, then adding Z should not change the fact that X is favorable over Y.  Z is supposed to be an irrelevant alternative to the consideration of X vs. Y.  See the Wikipedia article for more details (https://en.wikipedia.org/wiki/Independence_of_irrelevant_alternatives).

The IIA suffers from several criticisms (e.g., Red Bus - Blue Bus problem), which usually stem from having "alternatives" that are too similar to be considered alternatives.  What most interested me in reading about this principle is its applicability (or inapplicability) to voting.  In reading about various voting strategies, the problem seems to arise when one has 3 (or more) candidates and no candidate is preferred as a majority to all other candidates.  That is, some percentage less than 50% prefers A to B, some percentage less than 50% prefers B to C, and some percentage less than 50% prefers B to C.  This means that, for any candidate elected, there could be a majority that would have preferred a different candidate to be elected (again, see the Wikipedia article for more details).

When reading about this, what immediately came to mind was the 2016 US Presidential election.  It seems that, generally speaking, everyone was dissatisfied with the two main candidates, Trump and Clinton, and that a third candidate (Johnson), while preferred by very few outright, was seen as a legitimate alternative by many as opposed to the possibility of either Trump or Clinton being elected.  The final popular vote percentages were as follows (from http://www.presidency.ucsb.edu/showelection.php?year=2016):
  • Clinton: 48.2%
  • Trump: 46.1%
  • Johnson: 3.3%
Suppose, for the sake of argument, that:
  • 48.2% preferred Clinton to Johnson, and Johnson to Trump
  • 46.1% preferred Trump to Johnson, and Johnson to Clinton
  • 3.3% preferred Johnson to Trump, and Trump to Clinton
Then we have the interesting result that:
  • 51.5% prefer Johnson to Trump, who did in fact win.
Or suppose we change the last line and make it:
  • 3.3% preferred Johnson to Clinton, and Clinton to Trump
Then we get that:
  • 49.4% preferred Johnson to Clinton, which is higher than the popular vote for either Clinton or Trump.
If the above is close to people's actual preferences (that is, people preferred Johnson to the other party's candidate), then one can make the argument that Johnson "should" have been elected as the compromise candidate that most, or even a majority of, people would prefer if their primary candidate wasn't chosen.  Perhaps this isn't actually how people would have ranked the candidates, but it is certainly a reasonable possibility.

To go back to the multinomial model, what this means is that in many cases, perhaps even the most important cases where we want to predict in a multinomial way (i.e., elections), the assumption of IIA which is necessary will almost certainly be violated.  Often, our most important decisions involve choices that are too similar to each other, containing overlapping positives and negatives that are not clearly weighed against each other or decisive in leading to a choice.  We have many conditional choices that in fact do depend on the presence or absence of alternatives that we would not choose.  While there are other models that one can choose that do not rely on IIA (which have their own challenges), this problem certainly presents a challenge to those doing multinomial prediction.

Wednesday, June 20, 2018

Cross Validation: An Experiment

The purpose of a predictive model is NOT to predict existing data well.  Predicting existing data well is the means, not the end, which is instead to predict future unseen data well.  Typically, in order to prevent creating a great fitting model that doesn't perform well on future data (i.e., overfitting), an existing data set is divided into a training set and a test set, where the model is trained on the training set and then tested against the test set to see how well it performs.  However, what if you don't have enough data to have a separate test set?  Or how do you get sense of the variation in your model's performance?  You may have gotten a luck test set...

Cross validation provides a way to test one's model while still using all of the data available to train it, and one can be confident that one didn't get lucky in the test data being used.  In short, the model is created using a subset of the data before being tested against the hold out subset of the data.  This can be done many times.  Typically, with K-folds cross validation, 10 "folds" are created, and a model is trained using 9 of the folds and tested against the hold out fold, before repeating again using the next combination, for a total of 10 different models.  The results are then averaged so one can how well one's model does "on average".

While 5 or 10 folds is typically standard, I wanted to see what would happen across a variety of different values for K.  Using the Boston data set found in the MASS package, I used the caret package to perform the cross validation.

To start, I created a baseline glm model to predict a target variable (whether crime per capita is above or below the median crime rate).  This gave me a model with an accuracy of 0.9249 with a 95% CI of (0.8971, 0.9471).  Next, I found the average accuracy for each model trained using 2 to 250 folds.  The 249 results had an average accuracy of 0.9117 with a 95% CI of (0.9056, 0.9177).

From the below chart one can see that at lower values of K there is a slightly lower accuracy (there is a vertical line at K = 10).  Beyond 50 folds the average seems to be pretty settled around 0.9125 with some variation between 0.900 and 0.925.



So what does this mean?  It suggests that our baseline model was a little too optimistic at an accuracy of 0.925, as this is the upper bound of how the cross validated models performed.  However, this result is still within the CI provided by our baseline model, so that is good to know and reassuring.  In any case, we have a more reliable assessment of how our model will perform on future unknown data.  As expected, it is lower than the training accuracy, but not by so much as to make the model unusable for future predictions.

Also, we notice that our estimated future data accuracy will be impacted by the number of folds we choose.  Too few folds will mean that we have not tested against enough hold out data sets to make sure our model will generalize well, and our future data accuracy may not be trustworthy.  On ther other hand, too many folds may produce wildly different models due to differences in each small sample (plus this may take a lot of computing power).  While I'd need more experience to generalize, I don't see a reason not to use the standard K = 10 as a good middle ground.  In my model, increasing the folds above 50 doesn't move the accuracy at all.  And when viewing the results on a plot with a y-axis of 0 to 1, there is no discernible difference.


To summarize, if you need a way to test your model's ability to generalize to future/test data and would like a sense of how well it does this, consider cross validation to create a more robust, stable, and reliable model. 

Models aside, testing one's hypotheses, beliefs, or claims about the world (i.e., the results from one's model) to see how they perform in the world of facts and experience (i.e., future data) is just good epistemic practice.  If we want our beliefs to be justified and well grounded, shouldn't we want our models to be also?  So just as you should "cross validate" your beliefs, make sure to justify your models as well.  Cross validation is an effective means of doing so.


References:
https://www.r-bloggers.com/cross-validation-for-predictive-analytics-using-r/

Wednesday, June 13, 2018

The Confusion Matrix, Cost-Benefit Analysis, and Decision Making

The confusion matrix is aptly named.  Even though I have studied and used it many times, I still get confused and turned about in trying to remember the meaning of sensitivity vs. specificity or false negatives vs. false positives.  Despite its confusing nature, understanding what the confusion matrix means and being able to correctly use it is not important only for those in data related fields.  If we abstract the confusion matrix and understand it as being a way to assess trade-offs or cost vs. benefits, then we can see that this skill or way of thinking is important for everyone and can help us make better decisions in our daily lives.

The confusion matrix is typically presented as a four box table comparing predictions vs. actual values for a model.  On the rows, we have the predictions for a class (positive on top, negative on the bottom).  On the columns, we have the actuals for a class (positive on the left, negative on the right).  Consequently, when our model predicts a positive accurately, this is a true positive (top left corner).  Similarly, when our model predicts a negative correctly, this is a true negative (bottom right corner).  However, when our model predicts a positive but this is actually a negative, this is called a false positive (the model falsely predicted positive).  Similarly, when our model predicts a negative but this is actually a positive, this is called a false negative (the model falsely predicted a negative).  See here if you are still confused.

Ideally, we want our predictive models to correctly predict positives when the actuals are positive and negatives when the actuals are negative, and so there would be no false positives or false negatives.  But this isn't reality.  Reality is this: (1) our models, even our best models, get it wrong sometimes (they produce false positives and false negatives); (2) we have competing models that make different predictions and get different right and wrong predictions.

This presents us with a question: in the face of multiple models, which model do we choose and why?  Enter in costs and benefits associated with each kind of correct prediction and incorrect prediction.  But before we can even do that, we have to really understand what problem we are trying to solve.  What are we trying to do?  What is the model supposed to help us do?  What in the end are we trying to optimize?  Once we have a good understanding of this, then we can assign costs and benefits to each kind of prediction and outcome, and then choose the model that offers the best benefits with the fewest costs.

For example, suppose we have medical test A and medical test B for detecting a certain disease.  Perhaps medical test A has more true positives and more true negatives than medical test B.  Then it seems we ought to choose medical test A.  However, suppose that medical test A has way more false negatives than medical test B, and a false negative leads to death of the individual (the disease is present but was missed by the test and immediately led to the death of the individual since it was not treated).  Then maybe medical test B is better.  Suppose the medical test B is prohibitively expensive while medical test A is relatively cheap.  Then maybe we are back to choosing medical test A as the preferred option all things considered.

The point here is that we have to be clear about what we are trying to optimize, and there will be tradeoffs (e.g., cost vs. life-saving) that we will have to assess in order to optimize.  And people may disagree as to how to correctly assign costs and benefits or in what ought to be optimized.  In such a case, some sort of compromise will likely be required (e.g., saving costs while also saving lives).  Still, the very act of thinking critically about what we are trying to optimize, and what the relevant costs and benefits are, is a useful skill that needs to be developed, and is what is required in making effective decisions in all parts of life.  A confusion matrix forces us to think in this way and to develop this skill.

While the confusion matrix can perhaps tell us which model is statistically best, it cannot tell us which model is best from a cost-benefit analysis in the given business, medical, or other context.  We have to supply the criteria for what is a cost and what is a benefit.  But thankfully, once we are clear on that, we can figure out which model to use in order to optimize the cost vs. benefit tradeoff.  And once that model has been chosen, we can confidently use it to guide our decisions on what to do based on its predictions.  We will also have benefited from being forced to deeply understand the nature of the problem we are trying to solve.  This exercise makes us better thinkers and better decision makers, and hopefully, makes the world a little less confusing to us and others.

Wednesday, June 6, 2018

The Median and the Mean

The mean and median are both used to understand what the "center" of a distribution is, or what may be considered "normal" or "typical".  But what are their strengths and weaknesses?  Why prefer one to another, and when should you choose one versus the other?

The mean is the average value, and is found by adding up all values and then dividing that sum by the total number of values in the calculation.  For example, the mean of 2, 3, 4 is (2 + 3 + 4)/3 = 3. The median is the middle value, found by arranging all values in order and then selecting the middle value (or using the average between the two middle values if there isn't a single middle value).  For example, if we have 3, 6, 1, 8, 9, then the median is 6 since 6 is the middle value (1, 3, 6, 8, 9).

In a normal distribution, uniform distribution, or any other distribution of values that has symmetry around the middle (same number of values on either side of the "middle", same standard deviation of values on either side of the "middle"), the mean and median should be roughly the same.  However, when the values are skewed in some way (more values on one side of the "middle" than the other, differing standard deviations on either side of the "middle"), then these can be radically different.  A single value like an outlier can greatly change the mean while leaving the median untouched.

So why wouldn't we always use the median if it is always going to give us a better "middle" value?  Because it may not actually give us a better "middle" value. For example, suppose you have the values 1,2,2,3,98,99,100.  The mean is 43.5.  While there is no actual value of 43.5, we intuitively believe this to be the center of the distribution of values.  However, if you find the median, you get 3, which does not adequately describe the middle value of the distribution.  So, if the values are in two separated groups, the median will not find the center between those two groups, but will be in one group or the other, whereas the mean will find a non-actual value that is in between the two groups and does more accurately represent the "middle" value.

Also, suppose that we have this set of values: 1,1,1,1,1,1,1,50,50,50,50,50,75,100.  The mean is about 30 while the median is 25.5 (halfway between 1 and 50).  This is similar to the above situation in that the mean seems to do a better job at getting at the "middle" value than the median.  In this case, both the mean and the median are values that do not actually exist in the data.  We might also say that the best value to use here as being the "typical" value is the mode, which is 1.

We could contrive other sets of  values that make the mean, median, or mode the "best" or most intuitive single representative of the "typical" or "normal" or "middle" value in the data set.  The main point here is that the above shows the need to look at the data to understand which value makes the most sense for our given purpose.  Neither the mean nor the median will be the "right" value representation of the "middle" or the "typical" value for every situation.

One other consideration.  The mean is typically faster to find or calculate than the median.  Why?  For the mean, the order of the variables does not matter: you add them up and divide by their count.  For the median, the values have to be sorted first before the middle value can be found.  This takes longer.  I ran a couple experiments on my computer in R to test this.  Here are the results:

#1: 100,000 simulations of finding the mean and the median for a uniformly randomly generated list of 1,000 numbers from 0 to 1,000.
  • Total Time:
    • Mean:    2.07 seconds
    • Median: 7.76 seconds
  • Mean Time:
    • Mean:    0.00002070 seconds
    • Median: 0.00007765 seconds
  • Median Time:
    • Mean:    0 seconds
    • Median: 0 seconds
  • Max Time:
    • Mean:    0.00250 seconds
    • Median: 0.17618 seconds

#2: 100,000 simulations of finding the mean and the median for a uniformly randomly generated list of 10,000 numbers from 0 to 1,000.
  • Total Time:
    • Mean:    4.11 seconds
    • Median: 29.50 seconds
  • Mean Time:
    • Mean:    0.0000411 seconds
    • Median: 0.0002950 seconds
  • Median Time:
    • Mean:    0.000000 seconds
    • Median: 0.000489 seconds
  • Max Time:
    • Mean:    0.0035 seconds
    • Median: 0.1822 seconds
#3: 100,000 simulations of finding the mean and the median for a uniformly randomly generated list of 100,000 numbers from 0 to 10,000.
  • Total Time:
    • Mean:      23.38 seconds
    • Median: 243.73 seconds
  • Mean Time:
    • Mean:    0.000233 seconds
    • Median: 0.002437seconds
  • Median Time:
    • Mean:    0.000000 seconds
    • Median: 0.002001 seconds
  • Max Time:
    • Mean:     0.031 seconds
    • Median:  0.193 seconds
The median calculation takes 3 to 7 to 10 times as long on average as the mean calculation.  I am sure that if I continued to increase the complexity of the task, this disparity would also increase.

To conclude, think critically about which value (mean, median, or even the mode) best represents the data set for your particular purposes.  If either the mean or the median will suffice and especially if the repeated technical work you are doing is time consuming (e.g., data transformations, modeling), perhaps the mean will work better for performance purposes.