Wednesday, June 14, 2017

Data Science and Ethics: Deontology vs. Consequentialism

Introduction

Consider the following real scenario:
  • President Donald Trump has implemented a travel and refugee ban on anyone from one of several predominantly Muslim countries.  Suppose that, as a result, there are 0 terrorist incidents in the United States in the next four years, and assume that, were it not for this ban, there would have been several terrorist incidents.  
    • Is the travel and refugee ban moral?
  • Furthermore, suppose that the government has gathered various kinds of information about you based on your web surfing history, Facebook profile, shopping history, mobile device locations, etc. in order to fight terrorism. 
    • Is the collection of such data immoral? 
  • Furthermore, suppose that a machine learning model used to fight terrorism uses the religion (or race, gender, marital status, sexual orientation, etc.) of an individual to highly predict if a person will commit an act of terrorism.  Without this field, the model cannot accurately identify terrorists as well as it previously could.
    • Is the use of this highly sensitive, personal, and potentially discriminatory field permissible?
Underlying this real scenario is a sort of  dilemma, pitting two desirables against each other.  On the one hand, we want good results in the aims we pursue, so the more effective a means is to achieving those results, the better.  In this case, we want protection from terrorism and safety within the United States, and so we would embrace actions that further these aims.  However, we also believe there are limits that no means can cross, that the "end does not justify the means" in every circumstance.  If you are like me, then you are probably at least a little uncomfortable with using country of origin or religion in profiling immigrants, even in the service of fighting terrorism.

This is an age old debate between the opposing normative ethical theories of consequentialism and deontology, but it is now being played out in the realm of data science. Below, we will explore the relationship between ethics and data science and discuss how ethical theory can help provide guidance about the above scenario and similar situations.

Philosophical Terms/Distinctions

Here are some key terms and distinctions to understand:

  • Deontology
    • The normative ethical view that an action's moral rightness or wrongness is not determined solely by the outcome of the action.  Other factors are important as well (e.g., intention of the agent, the nature of the act).
  • Consequentialism
    • The normative ethical view that an action's moral rightness or wrongness is determined solely by the outcome or consequences of the action.  Typically, an action is morally right if and only if it maximizes happiness, utility, or some other desired output.
  • Morality vs. Legality
    • Actions may be legal but immoral.  Similarly, actions may be illegal but moral.  Thus, legality is not the same as morality.
    • For example, one might argue that the use of pornography is immoral, even though it is legal.  Similarly, in many countries, certain kinds of religious or political expression are illegal, but most of us wouldn't say that these kinds of expression are immoral.
  • User/Individual interests vs. Company/Government/Society/Group Interests
    • An individual/user's interests are distinct from the interests of a company, government, group, or even society.  While these interests might be aligned, they do not need to be aligned and may come into conflict.
  • Hierarchy of rights: life, liberty, property
    • There is hierarchy to rights.  In particular, the right to life is more important than a right to liberty, which itself is more important than a right to property.  Life is necessary for liberty (without life, you cannot have liberty), and liberty is necessary for property (without liberty, you cannot exercise your right to have or make decisions with respect to your property). 
    • In case of direct conflict, the higher right must take precedence over the lower right.

 

Data Science Core Concept: Machine Learning

At the core of data science is the use of algorithms to predict and classify entities (e.g., people) into different categories.  This is machine learning.  Most often, the algorithm uses various fields (e.g., age) to predict an unknown but useful value (e.g., spending behavior).  The more informative the fields fed into the algorithm, the more accurate and useful the predictions will be.  Thus, it is very important to figure out which fields are the most useful in predicting or classifying, and which are not helpful.

In some kinds of "deep learning" associated with AI, the algorithms do not require the use of fields.  The algorithm constructs these on its own before making a prediction or classification.  While such algorithms often work very well, understanding why the model produced the results it did can often be nearly impossible to interpret.  That is, we don't know why the "black box" works, but it does.

Ethical Data Science: An Exploration 

So how can we apply the above philosophical terms and distinctions to the realm of data science/machine learning, and in particular, thoughtfully address the above scenario? Let's explore.

A Consequentialist Start

The reasoning behind the travel ban has the form of something like:
  • If one is Muslim (or from a predominately Muslim country), then one is (or very likely to be) a terrorist.  
In the context of data science, the input field would be "Country of Origin" and/or "Religion", and the target field is "Terrorist/Not Terrorist".  Now suppose you have a sample data set that is representative of the world population with these fields, what would you expect to find?

If this is all that you have, not much.  We know that most people aren't terrorists.  Most Muslims aren't terrorists.  Most Muslims from predominately Muslim countries are not terrorists.  A model just based on these fields would probably just label everyone from these predominately Muslim countries as not being a terrorist to get a high accuracy, even though it would miss identifying any actual terrorists.  In terms of a confusion matrix, our initial model would have 0 false positives, 0 true positives, several false negatives (labeled "not a terrorist" but actually a terrorist), and many true negatives (labeled "not a terrorist" and in fact not a terrorist). 

In contrast, to avoid missing any actual terrorists, the model behind President Trump's travel ban goes to the opposite extreme and labels everyone from these predominately Muslim countries as a terrorist.  That is, the model he proposes would have many false positives (labeled "a terrorist" but not actually a terrorist), several true positives  (labeled "a terrorist" and actually a terrorist), 0 false negatives , and 0 true negatives.
 
So how would one improve this model?  With more data!  Religion and country of origin do not carry enough information on their own to provide a predictive model that correctly identifies terrorists as terrorists and correctly identifies non-terrorists as non-terrorists.  While I am not an expert at what can correctly identify a terrorist as such, I would think that, in addition to Religion and Country of Origin, fields like Gender, Race, Ethnicity, Age, Education, Economic status, Internet Activity (websites visited), Purchase History, and Travel History may be useful in more accurately identifying terrorists and non-terrorists.  Suppose that we get this information and now we have a very accurate model that correctly identifies both terrorists and non-terrorists.  Great!

In summary, in the initial set up, President Trump's model supposedly prevented all terrorist attacks.  In this it was extremely effective in the desired outcome: no terrorist attacks.  But it came at the cost of preventing many non-terrorists from immigrating to the US.  That is, it prioritized the right to life of innocent civilians over the right to liberty of prospective immigrants, assuming a rights conflict.  But is there such a conflict?  Our updated model suggests that, in theory, there isn't.  If one has enough information, one can protect the right to life of civilians AND the right to liberty of immigrants by correctly identifying both terrorists and non-terrorists. 


A Philosophical Reflection

But notice what we had to do.  We had to make use of lots of personal information collected by the government on individuals.  Is there a limit to the amount and kinds of information that the government or anyone else can collect?  Are there restrictions on how the collected information can be used?
 
I would bet that most people intuitively want to place limits on both the amount and kinds of information being collected on them, and the reasons that they would give for that intuition are often moral in nature.  For example, most people are fine with a certain amount of information being collected on them.  But as that information increases in size, they would become less and less comfortable with this.  They would likely express fears that the information will be used in inappropriate ways to target them, manipulate them, or otherwise influence them in ways that they would not want.  That is, the information could be used to harm them.

Not only is the amount of information important, the kind of information matters greatly.  Some kinds of information we aren't concerned about being publicly known (e.g., gender).  But there are other kinds of information that we would be hesitant to share (e.g., medical history, financial history) not merely because of how it could be used, but because of its personal and private nature.  We can certainly think of kinds of information that we wouldn't want anyone to know, especially private companies or governmental organizations (e.g., sexual history and preferences).  In short, the collection of some kinds of information about us would be a violation of our privacy, personal integrity, or human dignity and nature in the extreme.  These are moral concerns.

Furthermore, even if the amount and kind of data were permissible, the use of such data may not be.  In the model above, we had to make use of extremely sensitive information like race and religion to come up with this model.  While I may not object to my race and religion being known and used in some matters (e.g., demographic statistics), I would certainly have concerns with it being used in such a way as to harm myself and others, or deal with myself or others in ways I consider to be unjust.  Thus, there are morally proper uses of data.

Let's step back for a moment.  Why use data in the first place?  In short, to better achieve our aims and goals through an increase in knowledge on the subject at hand and thereby improve our decision making in achieving our goals.  On the governmental side, the use of data is intended to help in governing, whether it is to improve services, increase safety, or support other legitimate government interests.  For companies, data helps a company succeed in meeting competition, increasing profits, and other business aims. 

In contrast, an individual typically does not use data in this way but is the source of the data that governments and companies use.  An individual consents to disclosing such information in exchange for goods and services.  That is, there is a sort of marketplace transaction that occurs in which an individual "pays" for goods and services with his or her information.  Thus, we can think about that information as being part of an individual's personal property.

We do this in numerous ways every day, usually without thinking about it.  We sign up for Facebook, giving our names and other demographic information.  Through the use of its services, we also provide information on who our friends are, what we are interested in, where we live, and what we believe, knowing that online companies make money off of such information through the use of targeted advertising.  We submit to background checks to receive employment, to travel in the faster TSA-Pre lines, or obtain passports to travel abroad.  We sign up for customer discount cards at grocery stores to save money, while these stores use our purchase information to improve pricing, marketing, and ultimately, sales.  In short, there are legitimate and mutually beneficial arrangements that exchange data for goods and services.  Typically, as long as the source of data has consented to the exchange, then there isn't a moral issue.

 However, this is not always true. Even supposing we have consented to the collection and use of certain kinds of data, there still could be a moral objection.  John Stuart Mill famously argued that we do not have the liberty to sell ourselves into slavery, for we would be using our freedom to make ourselves un-free (a sort of self-contradiction).  Similarly, I would argue that there are likely certain amounts, kinds, and uses of information that, even though we were to consent to the collection and use of them, would still be immoral as they would violate our moral worth and dignity, and hence, even our consent could not morally justify the action.  This is in spite of any other benefits, goods, or services that would result for us (consider similar arguments against prostitution).

Thus, it seems that in order to avoid these moral pitfalls in collecting and using various amounts and kinds of data, one must (at least):
  • limit the amount of data gathered to morally acceptable limits
  • limit the kind of data gathered to morally acceptable kinds
  • use the data for morally acceptable uses
  • obtain consent from individuals about whom the data is gathered for the collection and use of such data
  • provide just compensation for the data in the form of goods and services to the source of the data
What these general limits are I cannot say, nor is it my point here to do so.  My point here is merely to argue that it is reasonable to believe that limits do exist.  As to what those general limits are, that would require much greater analysis than I have time for here.

A Deontological Critique

But perhaps we can say something useful about the specific scenario outlined above.  Does the above example satisfy the above criteria?  Let's explore.

The amount of data does not seem to be the issue here.  For the model above to work, no great amount of data regarding any single individual is required, although a large quantity of data is required (i.e., a single record for each person in the representative sample).

The kind of data being collected is a bit more concerning.  While country of origin isn't all that personal, religion is.  In some countries, that information being made public can mean the difference between life and death, freedom or persecution.  While this information may not be as sensitive as other kinds of information, it still is extremely personal.

The use of the data seems to be the primary concern.  Religion (and Country of Origin as a sort of proxy for religion) is being used to discriminate amongst immigrants, and this is cause for great concern.  Freedom of religion is protected by the First Amendment of the Constitution as a legal right.  Now such a right might not legally apply to immigrants (as they are not citizens of the United States), but it may morally apply.  If we believe that, morally speaking, all human beings have some right to freedom of religion, belief, or conscience, and this moral right is being restricted in some way without a direct conflict with a higher right, then that is cause for concern.

In this case, the government is treating certain country of origin/religion combinations as nearly guilty, even in the absence of any actual terrorist or criminal activity.  This reminds me of the movie Minority Report, in which individuals are arrested before they commit any crime, based on the prediction/foresight that they would commit the crime if not prevented.  Even though the program drives crime down to virtually zero, we are left with the impression that such a situation is unjust, for each supposed criminal has not yet committed any crime to be punished for.  A similar example comes from just war theory as opposed to pre-emptive strikes.  In just war theory, it is presumed that a wrong has been committed that needs correcting, whereas in a pre-emptive strike, no wrong has yet occurred even though it is perceived as imminent.  However, as many people note, while under the guise of a defensive measure against a likely aggression, a pre-emptive strike actually flips the roles of aggressor and defender. 

In our scenario, because a person may be a terrorist, he or she is treated as such, even though no actual terrorism has been committed by that person.  Joined with the facts that the reason for believing that he or she is a terrorist seems to be exclusively that he or she is Muslim and/or from a predominately Muslim country, and a person's religion is extremely personal and whose freedom of religion is a moral right, this use of data clearly seems to be unjustified and immoral, even if it would bring about great good (i.e., the prevention of terrorist activity).

The only escape from this argument I can see is that one must argue that there is a direct rights conflict, and that the right to life of US citizens takes precedence over the moral right to liberty of immigrants in their immigration and religion.  Is there a direct rights conflict?  I don't think so.  One has to say that there is no effective way to prevent or fight terrorism other than to ban all immigrants from certain countries on that basis alone, which does not seem reasonable. 

What is more reasonable is that there is no effective way to fight terrorism without taking into consideration religion and country of origin as part of the data used to identify terrorists.  That seems more reasonable, but it would hardly justify banning everyone from a certain country of origin or from a certain religion, as again, most Muslims from these countries are not terrorists.  Still, it may justify using these data fields as part of the overall data used to identify terrorists.

However, as mentioned previously, other kinds of information are likely much more informative in predicting whether or not someone is a terrorist, and if this sort of information is not sensitive or extremely personal in nature, then it can and should be used instead of the more sensitive information.  Thus, it is reasonable to believe that there are effective ways of fighting terrorism that do not rely exclusively or perhaps at all on this kind of sensitive data, and if so, there is no direct rights conflict between the right to life and the right to liberty.  In short, it seems extremely unlikely that we are forced to choose between upholding the right to freedom of religion (i.e., liberty) and right to life in this matter.  Or in other words, a claim that a failure to deny the freedom of religion and immigration to immigrants leads directly to the denial of the right to life of US citizens seems extremely implausible.

Has consent been obtained from the immigrants regarding the collection and use of such data?  It is hard to say.  Some information like country of origin is known publicly and explicitly through the immigration process.  Other information is perhaps voluntarily revealed as part of the application for immigration.  Would religion be part of this revealed information?  Perhaps.  Is it appropriate to ask about this as a condition for immigration?  Maybe, but it at least appears to be somewhat suspect given its sensitive and personal nature. 

Consent has been given as part of determining one's eligibility to immigrate to the United States.  But does this also mean a consent to using the data to fight against terrorism?  This is doubtful, unless eligibility means among other things that one is not a terrorist, which is reasonable: terrorists are not eligible to immigrate to the United States.  But one could argue that this is an instance in which even  the collection and use of such information, even though consented to, is still immoral given the kind of information being used and its intended use.

Has just compensation for the data in the form of goods and services been provided?  I believe so, if the exchange is the ability to immigrate to the United States.  It seems that this can be a just and reasonable compensation for the exchange of data.


Conclusion

So what can we conclude, if anything?  Much of what I have said above is surely debatable by reasonable and well informed people.  However, I believe the deontological concerns raised above suggest that even if preventing immigration from the several predominately Muslim countries stopped all terrorist activity that would have occurred otherwise, the practice itself would still be immoral given the highly sensitive nature of the data, how it is to be used, and the likelihood of using other kinds of data that are not sensitive but are still effective in fighting terrorism.  The consequentialist good of preventing terrorist activity cannot justify the, deontologically speaking, immoral means of collecting and using sensitive data like religion in this way.  In the absence of a direct rights conflict with a right to life, other effective and morally permissible ways of fighting and preventing terrorism must be found that do not infringe on the moral rights of immigrants in their liberty to immigrate and in their freedom of religion.

While I have focused on a specific example in the above discussion, a general point can be made that ties back to data science. The above shows that there may be cases, perhaps often, in which a predictive model may have to rely on data that from a collection, kind, or use perspective, is immoral, in order to achieve a desired level of effectiveness (e.g., accuracy, lift, precision, or other measure).  That is, data scientists will on occasion have to choose between building consequentially effective models and deontologically moral, though perhaps not as effective, models.  While such a choice can perhaps be avoided when informative but not morally suspect amounts, kinds, and uses of data exist, it seems that this won't always be the case.

Thus, it is important for data scientists to think ethically when doing their work. A model's predictive capabilities cannot be the sole criteria for judging whether it can be used or implemented.  We must also reflect on the moral implications.  Is the data ethical in amount and kind? Is it being used ethically?  Have the sources of data consented to its collection and use in this way?  Have they been justly compensated?  If the answer is no to any of these, are there ways to correct these issues or to use other kinds of data to achieve the same end?  If not, we may be forced to choose between being ethical and being effective.