Tuesday, December 6, 2016

Police Killings and Race

Below is the content of my final project for my Master of Data Analytics program class on probability and statistics using R.  The original file is located here.

---------------------------------------------------------------------------------------------------

Part 1 - Introduction:

The issue of race and police shootings has been in the forefront of United States politics in a more pronounced way in the past couple of years. There is a belief that police shootings may be racially motivated. That is, police may be biased against minorities and choose to use lethal force more frequently when minorities are involved.

If this is true, then it means that the police force is not enforcing the law equally without reference to race. Minorities are not being guaranteed the equal protection of the laws as they have a right to in the Constitution. This is bad not only for minorities, but for all US citizens that desire to live in a country governed by laws and not by the arbitrary whims or biases of those in power.

Does the data support the belief that police killings are biased against minorities? What conclusions are warranted? I will use data from the Guardian, which is attempting to create a database of all killings by police in the US, to begin to answer this question.

Our question: is there any evidence that, since the beginning of 2015, police haved killed non-whites at a higher rate than whites or in disproporionate amount as compared with whites? Is the police killings of non-whites statistically significantly higher than the killings of whites?

 

Part 2 - Data:

The data for this study comes from two sources. The first is from the Guardian database on police killings. The killings data is already collected and can be downloaded from the Guardian’s website. This information is collected by the Guardian through verified crowdsourcing and its own reporting.
The cases in this data are individuals who were killed by police officers. The name, date, race, gender, location, classification, and armed/not is included in the data. Between 2015 and 2016 and as of 12/5/2016, there were 2135 killings by police officers recorded in this data set. I will focus on the counts of killings by race, where race is a categorical variable and counts will be a numerical variable.

The second source is from the US Census data where we acquire information about race and gender population percentages for recent years. I will use the data from here.

The cases in this dataset are the population estimates by month, year, age, race, and gender for July 2016 - December 2016. I will use the population estimates for December 2016 by race and gender in order to calculate per capita statistics, age distributions, and race distributions. Race (a categorical variable) will be used to match to the police killings data to compare the counts of police killings (numerical variable) to the population estimates (numerical variables) per race. I assume that the relevant population percentages by race have not changed changed significantly from the beginning of 2015 through 2016 in such a way that would make comparing killings by race to population percentages by race invalid.

This is an observational study. I will be making inferences using already given data. I will not be setting up an experiment, nor did the data come from experiments. The data comes from observing population estimates related to race/gender and also counts of police killings by race/gender. Consequently, we will not be able to establish causality (i.e., minority race status causes a higher likelihood of police killings), but only correlation (i.e., minority race status is associated or correlated with a higher likelihood of police killings).

As the data demographics data comes from the entire United States and the police killings data comes from the United States, we can generalize our conclusions to the entire population of the United States by explaining counts of killings or killings per capita by race. That is, race will be the explanatory variable and number of killings will be the response variable.

One potential source of bias before moving forward is that the police killings data is provided on a voluntary basis. This means that if any police killings go unreported, especially if police killings for a certain race are more likely to go unreported, then our conclusions may be incorrect. For example, suppose that police killings for whites and blacks are proportionally similar, but only police killings for blacks are reported. Then our analysis will show that blacks are more likely to be killed than whites when in fact both races are equally likely.

As we have no way of knowing if race impacts the likelihood of a killing being reported, we have to assume that all killings are equally likely to be reported in our data, without difference due to race, gender, or any other factor. That is, we will assume that the data on police killings reflects an accurate and unbiased picture of all actual police killings in the United States.

 

Part 3 - Exploratory data analysis:

Police Killings

Let’s explore the data. Since 2015, 1070 whites have been killed by police, followed by 541 blacks and 354 Hispanic/Latinos. Whites constitute 50.1% of all killings, blacks 25.3%, and Hispanic/Latinos 16.5%. So we can see that whites are killed much more than other races by absolute counts of killings.

##            RaceEthnicity Counts        Perc
## 1          Arab-American      6  0.28103044
## 2 Asian/Pacific Islander     41  1.92037471
## 3                  Black    541 25.33957845
## 4        Hispanic/Latino    354 16.58079625
## 5        Native American     31  1.45199063
## 6                  Other      1  0.04683841
## 7                Unknown     91  4.26229508
## 8                  White   1070 50.11709602
 
What about gender? From the table below, we can see that of those who were killed by police, 2030 were males, 103 were females, and 2 others we unknown or “non-conforming”. Clearly, one is much more likely to be killed if one is a male as opposed to being a female.

##           Gender Counts
## 1         Female    103
## 2           Male   2030
## 3 Non-conforming      1
## 4        Unknown      1
 
Age is also an interesting variable to look at. The average age of the person killed is about 37 years old with a standard deviation of 13.1. The age of those killed is skewed right. Very few under 20 are killed, while the mid 20s through the late 30s are all high, and then the distribution steadily tapers off towards the upper ages.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6.00   27.00   35.00   36.91   46.00   87.00      19
## [1] "sd: 13.1099942645768"
 
Which states have the most police killings? CA (358), TX(195), and FL(134) have the most. This does not mean they are the most dangerous per se, because each state has very large populations. We would need to look at police killings per capita by state in order to determine in which states one is most likely to be killed by a police officer.

##   State Count
## 1    CA   358
## 2    TX   195
## 3    FL   134
## 4    AZ    91
## 5    GA    66
## 6    OK    65
 
Let’s look at the above state killings counts, but use per capita instead. Using projected values for each state’s population in 2016 based on data from the US Census, we can see that NM has the highest ratio of police killings to population, followed by OK and AK.

##   State Count Year Population RatioCountsPopulation
## 1    NM    43 2016  2093483.6          2.053993e-05
## 2    OK    65 2016  3942047.8          1.648889e-05
## 3    AK    12 2016   747272.9          1.605839e-05
## 4    DC    11 2016   687204.0          1.600689e-05
## 5    WY     8 2016   593512.4          1.347908e-05
## 6    AZ    91 2016  6898672.3          1.319094e-05
 

The classification of police killing varies as well. The vast majority are killed by gunshot (1940), but there is also death in custory (77), taser (70), struck by vehicle (46), and other (2).

##      Classification Counts
## 1           Gunshot   1940
## 2  Death in custody     77
## 3             Taser     70
## 4 Struck by vehicle     46
## 5             Other      2
 
Finally, we have information about whether the person killed by a police officer was armed or not, and what kind of weapon was available to that person. Most individuals have a firearm when killed (1014), but the next largest category is that the individual was not armed at all (372). That is, 17% of the time, a police killing occurs and the individual is not armed.

Assuming that such an individual is not posing an immediate threat to the lives of the police officers or others, and police officers should only take lethal action when this is the case, this is surprisingly (and disturbingly) high. I would, perhaps naively, assume that such instances would be much lower, and only reflect situations in which the police officer thought the person had a weapon, but he or she in fact did not. Or perhaps such instances occur when the person is attempting to take a police officer’s weapon, or is hitting or charging a police officer, and so is in fact threatening the life of the police officer or another person.

Without more research into the exact circumstances of these non-armed killings, it is impossible to say whether such killings were justified or not. Further research does seem to be necessary, given that the number is so high. These are clearly not isolated or infrequent occurrences, and as such, they cannot be ignored.

##                Armed Count
## 1            Firearm  1014
## 2                 No   372
## 3              Knife   295
## 4            Unknown   150
## 5              Other   127
## 6 Non-lethal firearm    89
## 7            Vehicle    75
## 8           Disputed    13
 

Do there appear to be differences by race in whether the person killed was armed or not? By percentage of all killings within a given race, there are noticeable differences. In particular, blacks have the highest percentage of being killed when not armed (21.6%).

Demographics

Let’s turn our attention to population demographics by race. For December 2016, we have the following estimates:
  • Males are 49.2% of the estimated population. Females are 50.8%.
  • Whites alone are 76.7% of the population
  • Blacks alone are 13.3% of the population
  • American Indian alone are 1.3% of the population
  • Asian alone are 5.8% of the population
  • Hawaiian/Islander alone are 0.2% of the population
  • Hispanic are 17.9% of the population
The full results are below.

##           Type        value
## 1      TOT_POP 3.250328e+08
## 2     TOT_MALE 1.600346e+08
## 3   TOT_FEMALE 1.649982e+08
## 4      WA_MALE 1.236208e+08
## 5    WA_FEMALE 1.258108e+08
## 6      BA_MALE 2.073687e+07
## 7    BA_FEMALE 2.255415e+07
## 8      IA_MALE 2.060932e+06
## 9    IA_FEMALE 2.024201e+06
## 10     AA_MALE 8.955957e+06
## 11   AA_FEMALE 9.859911e+06
## 12     NA_MALE 3.984740e+05
## 13   NA_FEMALE 3.860640e+05
## 14      H_MALE 2.944718e+07
## 15    H_FEMALE 2.885643e+07
## 16      WA_ALL 2.494316e+08
## 17      BA_ALL 4.329102e+07
## 18      IA_ALL 4.085133e+06
## 19      AA_ALL 1.881587e+07
## 20      NA_ALL 7.845380e+05
## 21       H_ALL 5.830360e+07
## 22     WA_PERC 7.674043e-01
## 23     BA_PERC 1.331897e-01
## 24     IA_PERC 1.256837e-02
## 25     AA_PERC 5.788914e-02
## 26     NA_PERC 2.413720e-03
## 27      H_PERC 1.793776e-01
## 28   MALE_PERC 4.923645e-01
## 29 FEMALE_PERC 5.076355e-01

Bringing them together

We have explored both data sets separately. Now let’s combine them and explore the combination. Since 2015, the chances of a United States resident getting killed by a police officer are 6.568 per million residents. For whites, this is 4.29. For blacks, this is 12.49. For Hispanics/Latinos, this is 6.07.

So even though many more whites are killed than blacks or any other race, proportionally, whites are killed below the average per million residents while blacks are killed at almost twice the average rate per million residents.

When we look at how the percentage of killings by race compares to the percentage of total population by race, we see that Asian/Pacific Islanders constitute 6% of the population, but only 2% of killings. Blacks constitute 13% of the population, but 25% of all killings. Hispanic/Latino constitute 16% of the killings and 17% of the population. Whites constitute 50% of the killings and 76% of the population.

So it appears the Asians are under-represented, blacks are over-represented, and whites are under-represented in the police killings as compared with their representation in the total population. Hispanic/Latino appear to be represented as we would expect.

RaceEthnicityCountsPercPopulationAVG_Kill_Pop_MillionPercTotalPopulation
Arab-American60.2810304NANANA
Asian/Pacific Islander411.9203747196004062.0917936.030286
Black54125.33957854329102012.49681813.318971
Hispanic/Latino35416.5807963583036046.07166617.937762
Native American311.451990640851337.5884921.256837
Other10.0468384NANANA
Unknown914.2622951NANANA
White107050.11709602494315514.28975476.740433

Are these differences significant?

 

Part 4 - Inference:

Are blacks killed a disproportionate amount by police officers compared to other races? Are Asians or whites killed less often than expected by police officers? We can answer these questions statistically using Chi-Square testing for goodness of fit. The Chi-Square test will tell us if the actual distribution of killings by race is significantly different from what we would expect the distribution to be.

We have the following hypotheses:
  • H0: There is no difference in police killing counts based on race. That is, race does not make a difference in the number of police killings by race, and any differences can be attributed to the natural variations of randomness.
  • HA: There is a difference in police killing counts based on race. That is, race does make a difference in the number of police killings by race.
We have five race categories, so we need to find the chi-square value with 4 degrees of freedom.
We have two conditions for a chi-square test that need to be satisfied:
  1. Independence: we have no reason to believe that any case of a police killing is not independent of another case of police killing. And given that we have a very large dataset, even if some cases are not completely independent (e.g., 2 killings in single event), such instances should not impact the overall condition of independence to any significant degree.
  2. Sample size / distribution: each cellcount in our table has at least 5 cases, so the counts do satisfy the sample size and distribution requirements.
Both conditions are satisfied, so we can perform the chi-square test on the data.

By hand, we calculate a Chi-square value of 491.4602. Using the back of the book under 4 degrees of freedom, the highest value 18.47 for a X2 statistic, with a p-value of 0.001. Our X2 statistic is much larger, and so is clearly significant. Using the chisq.test in R, we get a X2 statistic of 505.446 with a p-value of 4.447916e-108. This is clearly significant.

We can also simulate the results. Using the expected value and SE for each count of killings by race, we can use the normal distribution to get dummy “actual” values to compare with the expected values. We can calculate X2 using these values, and then see the range of X2 values that we would see due to normal random variation. We calculate 100000 dummy X2 values. When we do so, we see that our actual X2 value from the data is so extreme, that the likelihood of getting this result due to random chance is virtually 0. It is larger than any other simulated X2 value.


So we know that this result is significant. That is, race does make a difference in the number of police killings by race. Which races deviate from what we would expect? By calculating the Z-score and a 95% confidence interval for our expected number of killings by race we can see which actual numbers of killings by race are significant. We also calculate the p-value (two-tailed) and show whether it is significant or not at the 0.05 confidence level.
Based on these calculations, we can see that:
  • Asian/Pacific Islanders are killed much less than we would expect. A 95% CI would have us expect between 107 and 151 killings, but only 41 are in our data set.
  • Blacks are killed way more than we would expect. This group has the highest Z-score (15.21). A 95% CI would have us expect between 251 and 317 killings, but we have 541 instead.
  • Hispanic/Latino and Native Americans fall within our expectations and are not significant at the 95% confidence level.
  • Whites are killed much less than we would expect. At the second highest Z-score (-14.04) we would expect with 95% confidence that whites would have actual killings between 1559 and 1718. Instead, the actual value is 1070.
In short, Asians and whites are killed much less than we would expect, and blacks are killed much more than we would expect.

RaceEthnicityCountsPercPopulationAVG_Kill_Pop_MillionPercTotalPopulationExpectedKillingsSEZpvaluepvalueSig95lower95upper95
2Asian/Pacific Islander411.920375196004062.0917936.030286128.7466111.346656-7.73325710.0000000TRUE106.50716150.98606
3Black54125.3395784329102012.49681813.318971284.3600316.86297815.21913640.0000000TRUE251.30860317.41147
4Hispanic/Latino35416.580796583036046.07166617.937762382.9712219.569651-1.48041590.1387623FALSE344.61471421.32774
5Native American311.45199140851337.5884921.25683726.833485.1801040.80433210.4212052FALSE16.6804736.98648
8White107050.1170962494315514.28975476.7404331638.4082540.477256-14.04265780.0000000TRUE1559.072831717.74367

Suppose we simulate the actual result using the expected value and the standard error. Using a normal distribution, we can create a distribution of the range of values we would expect to see due to chance. We use 100000 instances. When we do so, we still see clearly that Asians and whites are much lower in actual counts (red vertical line) than we would expect, while blacks are much higher. Hispanic/Latino and Native Americans capture the actual value in the range of simulated actual values.

The summary of the distribution and the p-value are given for each grouping. You can see that the simulated p-values match really well to the theoretical p-values. The 95% confidence interval for each simulation is also given.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   80.18  121.20  128.70  128.80  136.40  178.50
## [1] "pvalue: 0"
## [1] "95% CI: (106.544795272427, 151.026892619616)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   212.0   273.1   284.4   284.5   295.8   356.2
## [1] "pvalue: 0"
## [1] "95% CI: (251.420923833413, 317.486970243118)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   295.5   369.6   382.8   382.9   396.1   475.0
## [1] "pvalue: 0.1376"
## [1] "95% CI: (344.518769844633, 421.247691073343)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.40   23.33   26.83   26.83   30.33   51.84
## [1] "pvalue: 0.42082"
## [1] "95% CI: (16.7008318075859, 36.9674236281517)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1468    1611    1638    1638    1665    1825
## [1] "pvalue: 0"
## [1] "95% CI: (1558.73907069447, 1717.46375303207)"

 

Part 5 - Conclusion:

What can we conclude? While more whites are killed by police officers absolutely, blacks are killed way more proportionally. Asians are killed about a third as much as we would expect, blacks about twice as much as we would expect, and whites about two-thirds as much as we would expect. These are statistically significant results, and so we can conclude that race does make a difference in the number of killings by police.

Why is this the case? This result does not confirm police bias in favor of whites or Asians and against blacks. For example, just using the data that we have, we notice that age appears to be an important factor, as does gender. Nor does it look at the types of situations or crimes involved by race, which may help to explain why blacks are killed in a disproportionate manner. For example, if blacks were more proportionally involved in violent crime situations than other races, then that factor could explain why blacks are killed more. That is, if the above was true, then it would not be the matter of race that determined whether an individual was killed, but whether they were involved in a violent crime situation or not. However, that would then lead us to look into why blacks were involved in violent crime situations in a disproportionate manner.

We did notice that blacks are more likely to be killed while unarmed than other races are. Whether that is a significant difference would require further analysis. But it does seem to suggest that the above explanation regarding violent crime situations may not be a reasonable explanation. If blacks are in the same sorts of situations with the police that other races find themselves in with the same proportionality, but are being killed more often than other races are, then we would conclude that race is the determining factor in why they are being killed. Certainly whether the person is armed or not constitutes part of the police situation, and as we can see, blacks are killed more often proportionally speaking while unarmed than other races are. This would be fruitful area of research to explore in the future.

The data does not provide a full enough picture to conclude if there is police bias against blacks when it comes to killings. We would need to be able to look more into the context of the police situation as well as compare with a representative number of similar incidents that did not result in a person being killed. An ideal data set, although impossible to obtain, would have a list of all/most/many police encounters (as long as these are representative). More information would be provided about the specific nature of the crime or situation involved (e.g., traffic stop, robbery), and whether the person was killed or not would also be included. From that data, we could do a prediction on whether a person would be killed or not and see if race was a significant factor in predicting that outcome.

In short, without more data, we can’t get at the real question: are police biased against blacks in police killings? We know that significantly more blacks are killed than whites or Asians, but we can’t yet say why due to the lack of other possible explanatory variables in the data set and a comparison to non-lethal police encounters. Future research would attempt to get some of this missing information in order to determine why more blacks are being killed proportionally by police officers.

No comments:

Post a Comment