Wednesday, June 6, 2018

The Median and the Mean

The mean and median are both used to understand what the "center" of a distribution is, or what may be considered "normal" or "typical".  But what are their strengths and weaknesses?  Why prefer one to another, and when should you choose one versus the other?

The mean is the average value, and is found by adding up all values and then dividing that sum by the total number of values in the calculation.  For example, the mean of 2, 3, 4 is (2 + 3 + 4)/3 = 3. The median is the middle value, found by arranging all values in order and then selecting the middle value (or using the average between the two middle values if there isn't a single middle value).  For example, if we have 3, 6, 1, 8, 9, then the median is 6 since 6 is the middle value (1, 3, 6, 8, 9).

In a normal distribution, uniform distribution, or any other distribution of values that has symmetry around the middle (same number of values on either side of the "middle", same standard deviation of values on either side of the "middle"), the mean and median should be roughly the same.  However, when the values are skewed in some way (more values on one side of the "middle" than the other, differing standard deviations on either side of the "middle"), then these can be radically different.  A single value like an outlier can greatly change the mean while leaving the median untouched.

So why wouldn't we always use the median if it is always going to give us a better "middle" value?  Because it may not actually give us a better "middle" value. For example, suppose you have the values 1,2,2,3,98,99,100.  The mean is 43.5.  While there is no actual value of 43.5, we intuitively believe this to be the center of the distribution of values.  However, if you find the median, you get 3, which does not adequately describe the middle value of the distribution.  So, if the values are in two separated groups, the median will not find the center between those two groups, but will be in one group or the other, whereas the mean will find a non-actual value that is in between the two groups and does more accurately represent the "middle" value.

Also, suppose that we have this set of values: 1,1,1,1,1,1,1,50,50,50,50,50,75,100.  The mean is about 30 while the median is 25.5 (halfway between 1 and 50).  This is similar to the above situation in that the mean seems to do a better job at getting at the "middle" value than the median.  In this case, both the mean and the median are values that do not actually exist in the data.  We might also say that the best value to use here as being the "typical" value is the mode, which is 1.

We could contrive other sets of  values that make the mean, median, or mode the "best" or most intuitive single representative of the "typical" or "normal" or "middle" value in the data set.  The main point here is that the above shows the need to look at the data to understand which value makes the most sense for our given purpose.  Neither the mean nor the median will be the "right" value representation of the "middle" or the "typical" value for every situation.

One other consideration.  The mean is typically faster to find or calculate than the median.  Why?  For the mean, the order of the variables does not matter: you add them up and divide by their count.  For the median, the values have to be sorted first before the middle value can be found.  This takes longer.  I ran a couple experiments on my computer in R to test this.  Here are the results:

#1: 100,000 simulations of finding the mean and the median for a uniformly randomly generated list of 1,000 numbers from 0 to 1,000.
  • Total Time:
    • Mean:    2.07 seconds
    • Median: 7.76 seconds
  • Mean Time:
    • Mean:    0.00002070 seconds
    • Median: 0.00007765 seconds
  • Median Time:
    • Mean:    0 seconds
    • Median: 0 seconds
  • Max Time:
    • Mean:    0.00250 seconds
    • Median: 0.17618 seconds

#2: 100,000 simulations of finding the mean and the median for a uniformly randomly generated list of 10,000 numbers from 0 to 1,000.
  • Total Time:
    • Mean:    4.11 seconds
    • Median: 29.50 seconds
  • Mean Time:
    • Mean:    0.0000411 seconds
    • Median: 0.0002950 seconds
  • Median Time:
    • Mean:    0.000000 seconds
    • Median: 0.000489 seconds
  • Max Time:
    • Mean:    0.0035 seconds
    • Median: 0.1822 seconds
#3: 100,000 simulations of finding the mean and the median for a uniformly randomly generated list of 100,000 numbers from 0 to 10,000.
  • Total Time:
    • Mean:      23.38 seconds
    • Median: 243.73 seconds
  • Mean Time:
    • Mean:    0.000233 seconds
    • Median: 0.002437seconds
  • Median Time:
    • Mean:    0.000000 seconds
    • Median: 0.002001 seconds
  • Max Time:
    • Mean:     0.031 seconds
    • Median:  0.193 seconds
The median calculation takes 3 to 7 to 10 times as long on average as the mean calculation.  I am sure that if I continued to increase the complexity of the task, this disparity would also increase.

To conclude, think critically about which value (mean, median, or even the mode) best represents the data set for your particular purposes.  If either the mean or the median will suffice and especially if the repeated technical work you are doing is time consuming (e.g., data transformations, modeling), perhaps the mean will work better for performance purposes.

No comments:

Post a Comment