Introduction
In a previous article, I discussed philosophical views on the nature of scientific theories, and applied these discussions to data science models. I concluded that data science models, the terms they invoke and the relationships they postulate, ought to be considered to correspond to reality in some way. That is, a model's terms do in fact represent something real in the world (although this may be an abbreviation, summary, or approximation of potentially many real entities). Similarly, a model's prescribed relationship does represent something real in the world (e.g., a causal relationship amongst the terms in the model, or amongst hidden terms that make up the terms in the model, or....). While such correspondence may only be approximate and fall far short of 100% perfection and predictive accuracy, nevertheless, it is not merely useful. It does approximate the truth, or attach to reality, in some albeit imperfect way.Whether or not you agree, let's move on to another question in the philosophy of science that does not necessarily depend on how you answered the realist/anti-realist debate: what makes a good scientific model? How does this apply to data science models? Let's explore some ideas and then summarize at the end.
The Problem of Induction
Induction is the formation of generalizations or laws on the basis of past experience. We believe that future occurrences will behave like past occurrences, and so on the basis of past occurrences, we can predict future occurrences. For example, based on past experiences, we believe that we know (and have mathematically formulated a law) such that when billiard ball A hits billiard ball B in a certain way with a certain force in conditions X, Y, Z, etc., then ball A will go in this direction at this speed and ball B will go in that direction at that speed.However, we have no guarantee that the past will be like the future in most cases, as there are not typically necessary relationships between the objects we are interested in. It is conceivable, because it is not a matter of logical necessity, that ball A will spontaneously combust, or turn into a carrot, when it hits ball B. Such a thing has never occurred before, but that doesn't mean it cannot happen. Such thoughts have caused some people (most famously David Hume) to be skeptical about our ability to acquire knowledge through induction.
And yet, this is precisely what we do in the sciences. Even in the absence of logical necessity, we believe that we know what will happen to ball A and ball B in these circumstances, and we can reliably predict what does in fact happen with a very small margin of error. We even go so far as to form a law, a matter of physical necessity, to explain this relationship.
But what do we do in the face of competing "laws" that both explain the data we have? Which theory do we go with and use for future research and development of theory? This is the problem of induction. How can we justify inductive inferences? That is, how can we make universal or natural law claims based on experience, when so many alternative claims could be postulated?
Falsifiable
Enter Karl Popper.
His goal is to answer the problem of induction and to distinguish true
scientific theories from pseudo-scientific theories. He observes that it is really easy to
formulate a theory that explains the known data, since it is done so using that
data (hindsight is 20-20). While this theory
may be correct, one can think of many alternative theories that also explain
the data. How can one tell which theory
to accept?
Popper answers that each theory must make so called risky predictions, that is predictions which one should expect to be false unless the theory is right. A theory that is not refutable is merely pseudo-scientific. Once we have excluded the pseudo-scientific theories and we have competing scientific theories, we can test them on the basis of what each predicts, focusing in particular on where they would disagree in a prediction. That is, each theory must propose hypotheses that are then empirically tested after the theory has been formulated.
Conclusions are deduced from the theory, and these are then compared against each other to make sure that the theory is internally consistent, externally consistent with other unfalsified theories, and that when it makes a prediction, that prediction is correct. When a theory fails to predict accurately or is discovered to be inconsistent, it is falsified. If it is not inconsistent and does accurately predict, it is acceptable for use (although it may be falsified in the future). In this, Popper proposes a deductive style method of testing. We deduce in a manner similar to this: if theory A is true, then X must occur. X did not occur. Therefore, A is falsified.
While this is all well and good, we still have a problem: we can have two theories that are both unfalsified and that make different predictions. Which should we use until those predictions can be tested? To answer, let's look at some other virtues that make a model good.
Elegance and Parsimony
A theory that is more simple is to be preferred over a more complex theory, all else being equal. Simplicity can refer to both syntactic simplicity (the number of complexity of hypotheses in a theory; it is elegant) and to ontological simplicity (the number and kinds of entities postulated by the theory; it is parsimonious) (Stanford Encyclopedia). Most well known, Occam's razor asserts that “entities must not be multiplied beyond necessity."
So why should we prefer more elegant and parsimonious theories? That is, when faced with a choice between two theories that both explain the data equally well, why choose one over the other on the grounds that one is more simple? To answer, let us consider the field of epistemology, that is, the study of knowledge. Knowledge is said to consist in having a justified and true belief. When faced with competing theories, we are asking ourselves which theory we ought to believe to be true, so our focus is on the justification for each theory. Now we have already said that each theory is consistent with the data, so what other grounds do we have for believing one theory to more likely be true than another? Which is more justified?
The simpler theory is more likely to be true because of probability. Each entity in a theory has a probability of existing or having a certain relationship with the other entities. So the more we multiply the entities and relationships, the more we multiply probabilities, which always being less than 1, lowers the overall probability. For example, suppose you have a theory with 2 entities postulated versus 3 entities. If each entity has a probability of existing/having a certain relationship of 0.75, then the former theory has a probability of (0.75)^2 = 0.56 versus the latter theory of (0.75)^3 = 0.42 of being true. Probabilistically speaking, you ought to prefer the former theory because it is more likely to be true, and since the theories are otherwise indistinguishable, you have no other reason to prefer the latter theory.
Or returning to epistemology and the notion of justification, you have no reason for choosing a more complex theory over a more simple theory when both are equally explanatory of the data. Suppose for example that you return home and find that your house has been robbed. What would you conclude? You know that at least one person must have robbed your house. But are you justified in believing that two people robbed your house? What about an alien from outer space that came and robbed your house? If you have no reason to believe that more than one person robbed your house (or that an alien robbed your house), then it seems you are not justified in believing so. Instead, you must hold the theory that only a single robber broke into your house. This in spite of the fact that two robbers did really break into your house (unknown to you). That is, you must hold the most simple theory that explains the data to be true in order for that belief to be justified.
Granted, judgements about which theories are more simple, elegant, and parsimonious can be subjective to a degree. We may have disagreements in certain cases. However, we all intuitively have some understanding about what we are talking about and can agree on many cases that one theory is simpler than another.
Predictive and "Accurate"
These last three virtues are mentioned in the discussion on falsifiability, but deserve more attention in their own right. The first is that a model must be predictive. This is related to being falsifiable, in that a falsifiable theory makes predictions that can be proven to be false. But we are interested in theories that not only make predictions, but that make accurate predictions. In Popper's terms, we want theories that are strongly falsifiable and have failed to be falsified. These are our best theories and we have made lots of relatively accurate predictions based on conclusions derived from their claims. Consequently, they are extremely useful in advancing our understanding of the world and our interaction with it, according to our aims and purposes.
Coherence
A theory in order to not be falsified must be internally and externally consistent. We can think about this in terms of coherence. First, the theory must be internally coherent: any claim that the theory makes must not contradict any other claims by the theory. Such contradictions can be logical, or less strongly, physical. Even better is the case when the claims are supportive of each other (without being simply alternative ways of saying the same thing). Second, the theory must be externally coherent: it must not contradict (unless it is challenging the existing paradigm) any of the best scientific theories.
Informative and Explanatory
While there are perhaps other virtues that could be considered, let us consider a final one here. We do not want theories that are merely predictive and accurate. We want to understand why. Thus, we expect a good scientific theory to be informative, to explain why things are the way they are in the world. It will postulate the causal mechanisms that explain why something happens the way the theory accurately predicts. It will provide direction for new avenues of research in light of those causal explanations. In short, we do NOT want a black box, no matter how accurate that black box may be.
Data Science and Model Virtues
So how can the above be applied to data science models?
Falsifiable
A data science model must be falsifiable. It must make predictions (i.e., hypotheses) that are capable of being false, and are tested accordingly. This is why separating one's data into a training set, test set, (and verification set) is so important: it keeps one's model falsifiable. When one builds a model on all of the data, one can have an extremely accurate model when only looking at the data at hand. However, one is in danger of overfitting the model. One is modeling aberrations, errors, outliers, or biases in the sample data, and consequently, the model will not generalize to future data. It has NOT captured the real relationships underlying the data. Using a hold out test set can keep your model honest, and make sure that your model will generalize to data that it has not seen before.
Furthermore, doing so prevents you from refitting the model with each new addition of data. If one were to receive data on a daily basis, and on that basis, retrained the model, and if that model significantly changed each day, how confident would you be that your model was going to predict well? If it would predict something today and something different tomorrow, then your model is not stable and it is not going to make accurate predictions. It is no longer useful. It is as though your model is changing its mind every day, changing with the wind, and never subjected to critical scrutiny because it is always explaining the latest data without being held accountable for the inaccurate predictions it is making. This would be a pseudo-scientific model.
Elegance and Parsimony
A data science model must be as simple as possible or as is necessary, according to one's purposes. Why? Again, it is more likely to be "true" in the sense that one is more likely to have captured that actual relationships among the independent variables and their relationship to the dependent variable. But there is a challenge here, because more simple data science models tend to not be as accurate or predictive, and this can be due to excluding variables that are predictive, even to a small degree. So we don't want a model to be too simple, and yet, we don't want it to be too complex either given overfitting. We want to have a model that is as simple as possible without sacrificing accuracy and one that generalizes well when tested.
Predictive and "Accurate"
A data science model must be predictive and accurate. This is the whole point! We want to accurately predict unknown values. If a model doesn't do this, it doesn't matter if it is elegant or falsifiable. It isn't true. It does not accurately model reality. Your model must generalize to new data.
Coherence
A data science model must be coherent. I suppose one could have a model that contains a variable that is nearly the opposite of a different variable in the model, and the model could use them both. While possible, I am not sure that both variables would survive even minimal feature selection. Nevertheless, if your data science model is incoherent in some way, correct it, or look into why your model is paradoxical in this way.