Friday, December 15, 2017

Data Science and Philosophy of Science: Theory and Reality

Introduction

A major branch of philosophy called metaphysics may be characterized as the study of "being" or the nature of reality.  A sub-branch of metaphysics called ontology focuses on what exists and categories of existence.  The nature of causation is also considered within the realm of metaphysics.  Related to (and perhaps a further sub-branch of) these areas of philosophy is the philosophy of science, which studies the disciplines of science to understand their processes, assumptions, and commitments from a philosophical standpoint.

How might metaphysics and its related sub-branches/topics apply to the field of data science?  While there are several possibilities, I will focus on a debate in philosophy of science regarding the ontological status of scientific theories and the objects that are part of these theories, before applying that debate to data science.

Why is Science Successful? Realism vs. Instrumentalism vs. Constructive Empiricism

Why is science successful?  What needs explaining is the relationship between our scientific theories and objective reality.  For example, when we say that the Earth revolves around the sun, what does that mean?  Does that mean that we are claiming that the sun, the Earth, and the various gravitational forces actually exist in a mind independent way?  Or do we just mean that these concepts or ways of thinking are useful, practical, and help us achieve the ends we have, but that we shouldn't place any weight on the existence of such objects? 

This is the debate between scientific realism and instrumentalism (and constructive empiricism).  While the answer seems easy when discussing objects that we can verify with our senses, the answer is much more difficult when it comes to so called unobservable entities.  Does dark matter or anti-matter exist?  How do we know?  What about the theory of quantum mechanics?  Is a true depiction of physical reality, or is it just a useful way of thinking that makes the math come out properly?

Let's explore what each view is and what arguments for and against it there are.  For you hard-core philosophy of scientists, please forgive me if I am not exactly correct in my discussion below :)

Realism

Scientific realism is the view that "science aims to give us, in its theories, a literally true story of what the world is like; and acceptance of a scientific theory involves the belief that it is true" (Bas van Fraassen in The Scientific Image (1980)).  The intuition behind scientific realism is that we have made significant progress using scientific theories, and as science progresses, the theories we have are more reliable and lead to more reliable predictions.  The best way to account for this is that our scientific theories are true.  Hence, the objects of our theories we ought to consider to be real. 

So even if we have to postulate unobservable entities, we should believe these to be real and existing, even if we cannot see them, because what we can see and observe implies their existence.  Hence, unobservable entities are "discovered" in a sense, and not simply created.  The unobservable entities are considered to be mind independent (and theory independent).  Granted, science is fallible and theories are approximate, but we should think about scientific theories as being approximately true.

Instrumentalism

But what about the objects of scientific theories that are no longer accepted?  What about ether or ectoplasm?  These are things that were postulated to scientifically explain phenomena, and they may have worked well, but they have been discovered to not exist, and more reliable theories with different postulated objects have replaced them.  How can we be certain that unobservable entities in our current scientific theories will not suffer the same fate?  What about various geocentric models for the solar system that, although considered incorrect now, were still highly accurate in predicting phenomena?  How can we be certain that our current theories will survive the test of time?

This is the intuition behind instrumentalism.  Scientific theories "are just conceptual tools for classifying, systematizing, and predicting observational statements" (Duhem in The Aim and Structure of Physical Theory (1954)). This is pragmatism as applied to scientific theory.  Our scientific theories help us be more reliably successful in prediction and empirical observation, and they certainly can explain things, but we shouldn't take the things they talk about to be real or true, and we can't apply these theories to new issues.  In fact, the scientific statements we make cannot be considered to be true or false: they are just useful, and some are more useful than others.

Constructive Empiricism

Constructive empiricism may be thought of as a middle ground between realism and instrumentalism.  The constructive empiricist says that "science aims to give us theories which are empirically adequate; and acceptance of a theory involves as belief only that it is empirically adequate"  (Bas van Fraassen in The Scientific Image (1980)).  What does this mean?  In short, in contrast to instrumentalism, we can say that scientific statements can be true or false.  However, this does not commit us to believing in the unobservable entities.  In fact, one's acceptance of a theory is more of a commitment to continue to do research within that framework and to use the framework for new issues.

Arguments For and Against

All three views agree that a good theory must be predictive, informative, simple, and explanatory to be good (I'll discuss why in a follow up post).  However, they disagree on what this implies ontologically speaking.  So why believe one view over another?

In support of realism, we might regard the success of science as being miraculous if the objects in our theories are not real, that is, there is no relationship to or correspondence with reality.  It seems that the terms in our theories must refer to something real, although perhaps our characterization of those things is only approximately correct.  Furthermore, what is observable vs. unobservable is always changing, and what was once unobservable has become observable (e.g., bacteria, protons).  So even if something is unobservable now, it doesn't mean that is in principle unobservable or will never become observable.

But the anti-realist will respond that two theories can be equivalent empirically and yet postulate different unobservables, so they can't both be correct.  Both are empirically adequate.  Perhaps we should just say that neither is "true" though both are useful.

While I obviously won't resolve this debate, here are some quick thoughts.  It seems we should take some unobservables to be real, because past unobservables have become observable or have been disproven to exist.  Second, it seems scientific theories must be capable of being true or false to avoid pseudoscientific claims that cannot be disproven (I'll table this for a follow up post), and that when we speak about scientific theories, we are making claims about how the world really is.  Third, there may be unobservable entities that nevertheless exist.  Just because we cannot detect something does not mean it does not exist (this is to fall back into logical positivism).  This may make a theory that contains these difficult or perhaps physically impossible to falsify, but so long as it could in principle be falsified, perhaps this is ok.  So call me a cautious realist.

So what does this have to do with data science?

Why is Data Science Successful?

"All models are wrong, but some are useful."  So says George Box, a famous statistician, and this is something always quoted in any introduction to data science.  Which view does this match up with?  This seems to be an anti-realist position.  Is he correct?

Let's compare scientific models to the typical data science model from the anti-realist perspective.  First, the objects typically in scientific models are physical things (objects, forces), whereas the variables in a typical data science model are mostly conceptual.  We try to predict things like prices or movie ratings using variables like age or genre.  These are not objects in the world.  Instead, they are usually ways of expressing a value that people have. 

Second, scientific models always assume a causal relationship of some kind, whereas typical data science models don't postulate any sort of causal relationship, only correlation (and we all know, correlation does not equal causation).  How would the age of a person cause a certain movie rating, anyway?

Third, the relationship between the objects described in a scientific model is supposed to be unchanging, a law, whereas the relationships in typical data science models always seems to be changing and also depends on which variables are being considered.  For example, the coefficient for a certain variable may completely change depending on which other variables are being included in a model.

I could go on, but how might the realist respond to these objections?  First, one might say that our variables are short hand or abbreviations for physical things.  For example, a movie rating could be a proxy for something physical, like the amount of dopamine triggered in a person's brain when viewing that movie, and the genre composed of the various kinds of images and sounds depicted in the movie.  These are all physical things, but it is more useful and practical to speak about them in higher-level summary non-physical terms. 

Second, leaving aside debates about what causation actually is, using the prior response, we could tell a causal story between these variables.  When we say that the "increased square footage of the house caused its price to go up", we are actually saying something like "the amount of dopamine triggered in the brain by the recognition of increased square footage of the house triggered another neuron to fire that led to the individual valuing the house more, and being willing to pay more money for it."  This isn't exactly perfect, but you get the point: we can translate the non-physical entities and forces into physical entities and forces that underlie, compose, and give rise to them. 

Third, because these abbreviations and summaries are imprecise and don't account for every physical and causal entity, it shouldn't surprise us that our data science models are not perfect, or that the introduction of new variables (i.e., physical entities and forces) should change our understanding of the relationships.  However, that does not imply that the relationships are not legitimate or approximately true.  In fact, we have this intuition that if we could account for every physical entity and force in our data science model, then it would in fact be 100% accurate.

Conclusion

So what do I think?  I tend to have realist tendencies, so I am inclined to favor a realist interpretation of data science models.  I believe that our models must be attaching on to something real if they are in fact being useful.  And our concepts do attach to real entities and forces in the world, although they are summaries of very complex relationships among things and perhaps are not real in themselves (that is another debate).  So I am not surprised that our models are only approximately true (i.e., close in predicting numeric values, more often correct than not in categorizing). 

And while two models may both be equal in empirical accuracy, experience suggests that either only a few common variables are doing the main predictive work (and the additional variables are mostly noise) or the differing variables are both correlated with a third variable that really explains the correlation.  With additional work and time, the two supposedly differing models should converge and begin to look more and more similar as predictive accuracy increases.  Again, this suggests that there is something real that the models are latching on to, although it may be disguised and hard to uncover in a precise manner.

So I would respond that, strictly speaking, Box is correct: if a model is either all true or all false, then all models are in fact false.  But if models can be approximately true, then some models are approximately truer than others, and in this sense, are modeling reality and the relationships among entities in a truer and more real (and perhaps causal) way.

So how can we make sure that the data science model we have is approximating reality?  We'll explore this is a follow up post.

No comments:

Post a Comment