What if there is no true data-generating process?

Question

I've been engaging in a number of forecasting efforts recently, and have rediscovered a well-known truth: That combinations of different forecasts are generally better than the forecasts themselves. In particular, the unweighted mean of forecasts is typically better than any of the averaged forecasts. So far, in my own work, I have not encountered any exceptions to this except when the data is artificially generated from simple models.

I was, and remain, flabbergasted by this. Why should averaging models based on entirely inconsistent assumptions generate anything but nonsense? Why is the unweighted average of the best model with relatively inferior models usually better than the best? Why do we seem to get most of the benefits of sophisticated ensemble methods from the unweighted mean?

I always thought that the modeling process was intended to find the model that most nearly approximated the underlying reality, imperfectly, of course, but still assuming that there would always be the best model given specified constraints of parsimony, data availability, and the like. To me, the fact that the unweighted mean of a more-or-less arbitrary collection of model types (that experience has taught us are pretty good) does not suggest that the true model is roughly the mean of the constituent models---that would be absurd.

Instead, it suggests that there is no true data-generating process that can be approximated by any standard estimating technique, however sophisticated. The data may be generated as some complex summation or composite of many, many agents or sub-processes, each of which or who embodies a unique complex of causal forces, perhaps including multiple layers of non-linear feedback. Perhaps they are influenced or entrained by common exposure to forces that you as a modeler will never see, like the boss's mood or the ionization level in the air or irrational remnants of historical institutional structures that persist and still affect decisions.

You see this in other ways too. For example, sometimes the theory is utterly unambiguous about which models are to be preferred. It is, for example, entirely clear that most macroeconomic variables modeled by VARs or VECMs should be logged or log-differenced for multiple compelling reasons, both statistical (i.e. to avoid heteroskedasticity, to linearize any trend present), and economic. Except when you actually run such models, the opposite is true. I have no idea why.

My question is this. Has anyone found a way of formalizing the belief that processes we strive to understand have no data-generating process that we can capture in a standard mathematical model? Has anyone attempted to describe the foundations of statistics based on such a formalization -- a statistic in which all models are unavoidably misspecified? If so, does it have any known implications for hypothesis testing, and the sort of test-and-redesign process that constitutes normal workflow for a statistician or a data scientist? Should we be multiplying models earlier in the analysis process? If so, how? Should we be choosing which models to aggregate based on some principle other than the quality of fit with a complexity penalty, or some model comparison test like AIC? As things are designed ultimately to be input to ensembles, should we prioritize models that give different predictions, rather than good predictions? Is there a principled way to make such trade-offs?

And if this is the norm, why isn't it in any of the six widely-used introductory statistics texts I went through when composing this post?

*isn't it in any of the six widely-used introductory statistics tests* Did you mean *texts*? — kjetil b halvorsen, Sep 03 '21 at 03:55
See my answer at https://stats.stackexchange.com/questions/383731/likelihood-free-inference-what-does-it-mean for some relevant links & ideas, especially the book by Laurie Davis. This is about systematically seeing models as only approximations for data, not as some sort of *truth*. — kjetil b halvorsen, Sep 03 '21 at 04:09
This extends to people too: [Wisdom of the crowd](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) — Robin Gertenbach, Sep 03 '21 at 11:34
@RobinGertenbach but not uniformly so https://despair.com/products/meetings?variant=2457301507 ;o) — Dikran Marsupial, Sep 03 '21 at 14:23
@kjetilbhalvorsen Absolutely. good catch. I'll edit accordingly. — andrewH, Sep 11 '21 at 19:12
I always find it fascinating that taking a set of deliberately *dumbed-down* decision trees and putting them together as a random forest outperforms a decision tree that wasn't dumbed down. So https://stackoverflow.com/q/48239242/841830 is similar to your question, I think. You'll also find every Kaggle competition winner is using an ensemble approach; and every NLP state-of-the-art performance is just about always an ensemble of the *same* model, just built with a different random seed each time. A billion parameters and you can still outperform it with some unweighted averaging! — Darren Cook, Sep 13 '21 at 21:02

Tim · Answer 1 · 2021-09-03T07:18:44.003

Have you heard the "all models are wrong, but some are useful" quote? It's one o the most famous quotes in statistics.

Let's use human language as an example. What you say, is a result of many parallel and concurring processes. It is influenced by the rules governing the language, your fluency in the language, educational background, the books you've read in your lifetime, cultural factors, context, whom you're talking to, psychological and physiological factors influencing you at the moment of speaking, and many, many more things, and you may be quotig or misquoting someone who was influenced by them in the past, etc. There's no a function, process, or distribution that "generated" the words that came out of your mouth.

Playing an Advocatus Diaboli, now think of forecasting weather. It is hard, because weather is influenced that many interacting factors. Weather is a chaotic system. But maybe the system as a whole can be thought as a process the generates the weather?

It's a philosophical discussion. It's also an unnecessary one, at least form a practical point of view. We don't really need to believe that there's a distribution or process that generates our data. It's a mathematical abstraction. We wouldn’t be able to talk about statistical properties of estimators such as bias and variance (to give only one example), without introducing some abstract, mathematical objects for the things that are modeled. We are using mathematical functions to approximate something, this something needs also to be considered as a function, so can it be discussed in mathematical terms. We are not claiming that there exists a process that "generates" the data for us, we are just using an abstract concept to talk about it.

So yes, ale models are misspecified, wrong. They are only approximations. The "things" they approximate are just abstract concepts. If you want to really go all the way to the rabbit hole, there is no such things as sound, colors, wind, or trees, or us. We are just particles surrounded by other particles and we assign some meanings to groups particles that at a particular moment stay close to each other, but do those things exist? Maybe should we be building particle-level models of reality? A related xkcd below.

score 10 · Answer 2 · answered Sep 03 '21 at 07:07

10

Looking at it the other way, if there were no true data generating process, how did the data get generated?

The inability of standard estimating techniques to accurately approximate the true data-generating process doesn't mean that the data generating process doesn't exist, it just means that we don't have enough data to determine the parameters of the model (or more generally the correct form of the model).

However, when we make a model, our goal is not to exactly capture the true data generating process, only to make a simplified representation or abstraction of the important features of the true data generating process (TDGP) that we can use to understand the TDGP or to make predictions/forecast of how it will behave in some situation we have not directly observed. Our brains are very limited, we can't understand the detail of the TDGP, so we need abstractions and simplified models to maximise what we are able to understand.

Rather than say there is not TDGP, I would say there is no such thing as "randomness" (except perhaps at a quantum level, but even that might not be random either, although the Bell experiment suggests it probably is). We use the concept of "random" to explain the results of deterministic systems that we can't predict because of a lack of information. So the purpose of a statistical model is express our limited state of knowledge regarding the deterministic system. For example, flipping a coin isn't random, whether it comes down heads or tails is just physics, depending on the properties of the coin and the forces applied to it. It only seems random because we don't have full knowledge of those properties or forces.

At the end of the day, the more data we have, in principle the more information we can extract from it (with diminishing returns), and the better our state of knowledge about the TDGP.

The reason averaging helps is that the error of the model is composed of bias and variance, c.f. @Tim's answer (+1). If we don't have much data, the variance component will be high, but that variance will not be coherent for models trained on different samples, and so will partially cancel when model predictions are averaged. This is not telling you anything about the TDGP, it is telling you about the estimation of model parameters (and that you should get more data if you can).

answered Sep 03 '21 at 07:07

Dikran Marsupial

46,962
5
121
178

1

A quibble about wording: how is the true DGP different from the DGP? I think "true" is largely superfluous here. – Richard Hardy Sep 03 '21 at 07:53
@RichardHardy it probably is, although if you have a generative model then it distinguishes that data generating process from the real one. There is also the nuance about whether we mean the data generating process in its *entire* detail and the "true" sort of hints at that. The key point for me is that we are never trying to capture the full/true data generating process in *any* branch of statistics, AFAICS. – Dikran Marsupial Sep 03 '21 at 07:58
1

Makes sense. Though I would still call a generative model merely a model, not a DGP. But that may be a matter of taste. – Richard Hardy Sep 03 '21 at 08:53
I don’t think the answers above get at the source of my puzzlement, although the weather example comes close. Assume that there is a relationship, expressible mathematically, between the antecedent causes and the consequent effects. Our knowledge of the causes is partial and noisy, sure. The true relationship casting the shadow on the cave wall is unknown and un-(with certainty)-knowable. Fine. But that does not explain why an _unweighted_ mean of forecasts from a relatively arbitrary handful of models should _consistently_ be better than the _best_ model in the group. – andrewH Sep 11 '21 at 19:59
In the “wisdom of crowds” example above, I’d just assume that each person has observed their own sample of reality, in which case you expect the average to be better by the law of large numbers. But the instances I have been finding in practice involve multiple models estimated on _exactly the same data_. – andrewH Sep 11 '21 at 20:16
I admit that “there is no data-generating process” is a rhetorical exaggeration. Plainly, if there is data, it is generated. What I really intend this to express is that the data generating process does not consist of a mathematical relationship such that, with the addition of error terms to handle our ignorance, omitted variables, etc., it produces reality – that there is no straightforward mathematical relationship between causes and effect for our models to approximate. – andrewH Sep 11 '21 at 20:21
I think that this phenomena is peculiar to complex systems interacting in complicated ways. I’d bet money that the best model of the distribution of electrical charge in a regular crystal is better than the average of several pretty good models. In forecasting, unlike physics, mathematics is not unreasonably effective. Sometimes it is not even reasonably effective – andrewH Sep 11 '21 at 20:21
There is nothing particularly surprising about ensemble forecasts being more reliable, it is just the bias-variance decomposition, the variance components of the committee members are unlikely to be coherent, so they tend to cancel under averaging. The reason why it is consistently better is likely to be that the error your forecasts are dominated by the variance component. It isn't always the case. – Dikran Marsupial Sep 11 '21 at 22:15
It is trivial to construct an example where the unweighted mean is not better than the best member of the ensemble. If I want to predict the result of the next roll of a six-sided die, I could get 20 of my students to write a program to simulate a die roll and run the experiment. The unweighted mean of the forecasts from those simulations is *very* unlikely to be better than the best of those 20. – Dikran Marsupial Sep 11 '21 at 22:26
This thought experiment also makes the point that the best of those ensemble members isn't the best because it is a more accurate model (they are *all* just guessing). The reason it is the best is because it's "random" component, purely by chance, happens to be correlated with the "randomness" of reality. – Dikran Marsupial Sep 11 '21 at 22:31
BTW physicists commonly use "linearisations" of complex mathematical systems. When we use linear model in statistics we know that the underlying system is vanishingly unlikely to be *exactly* linear. Part of being a good statistician, like a good physicist, is knowing when these approximations are reasonable. As I said, we are not trying to exactly capture the TDGP, we are trying to make a model for our understanding or for making predictions. – Dikran Marsupial Sep 11 '21 at 22:42

What if there is no true data-generating process?

2 Answers2