Does cross validation say anything about parsimony?

Question

Suppose I had a set of models that all attempt to explain some phenomena. According to a sensible—and appropriately cross-validated—performance metric, all of the models perform comparably well. The models' predictions, while not totally identical, are fairly well correlated.

However, some models are more complex than others (e.g., have more parameters and hyperparameters to optimize) and they may even use different type—and number—of predictors as input. Furthermore, the metric used to evaluate the models does not penalize model complexity (e.g., cross-entropy, rather than BIC),

What, if anything, does this say about the models' explanatory power?

I ask because I've recently encountered this situation in a few talks, and there seem to be two opposing arguments.

My gut feeling is essentially Occam's Razor: the best way to demonstrate our understanding of a phenomenon is to build a simple but effective model from explicit input features. For example, a model that explicitly states "mutation in gene X leads to disease Y" seems to demonstrate more understanding of the gene/disease than using a huge neural network to pull the same information out of the raw reads. This is especially true when the NN does not perform much better than the simple model.

Both speakers' counter-argument was that, no, the most complicated model is the most interesting: it had the capacity to overfit the data, but judging by their equivalent cross-validated performance, it apparently did not. Thus, we should be investigating the structure of the complex model instead, since the true phenomena is probably complex and not captured by the simpler models.

Matt are you really asking a question or just letting off steam :)? Of course the speakers are going to say that. Clearly a simple model is better than a black box model. Neural networks are baggy monsters - you need a large number of neurons just because you can end up with dead neurons. then you do stopped training and drop out - the model can end up having no more variance than your simpler model! — seanv507, Dec 11 '18 at 18:25
On a separate note, to what extent is the cross validation indicative of the population? simpler models are likely to perform better with the 'concept drift' of real data. I'd encourage you to look at Kaggle winning solutions - they seem overfit to the particular sample contiguous train/test sets. — seanv507, Dec 11 '18 at 18:29
Bit of both :-) The first time, I chalked it up to...cranial insufficiency, but when the same claim keeps popping up, you gotta wonder if you've missed something. — Matt Krause, Dec 11 '18 at 18:36

score 2 · Accepted Answer · edited Jun 11 '20 at 14:32

First of all, I agree with @HEITZ: if all we have is equal cross validation performance, then that's all we have and it does not allow further distinction. Also, one model may be just as badly underfit as the other is overfit...

As usual, this is where external (independent) knowledge about the situation at hand helps a lot, e.g. in judging what is going on. I'm thinking of, say, a discriminative classifer vs. a one-class classifer that both yield the same predictions and thus the same error/performance measure. The one-class classifier is more complex - but the decision one-class classification vs. discriminative classifier should anyways be based on the nature of the data/application. And yet, there may be situations where one concludes that one-class classification would be needed but the available data does need a more restictive model (with important differences in the CV-measured performance).

However, I'd like to point out that it is possible to measure some symptoms of overfitting (namely, instability of predictions based on exchanging a few training cases) by iterated/repeated cross validation even if the chosen error measure per se does not penalize complexity.
Therefore, I reserve the right to not believe that the complex model is not overfit unless results are presented that clearly show that possible overfitting was checked and found to be absent and that excludes the possibility of reporting a lucky cross validation split (particularly if the complex model has hyperparameters that are aggressively optimized).

On the other hand, resampling validation cannot guard against drift in the underlying population - and such drift may either need a more complex model (human brain can correct for such drift in an amazing fashion!) or less complex model (that doesn't overfitt, so data drifting slightly out of the training space will not be subject to totally weird predictions).

Secondly, I'd like to argue that the usual optimization approaches we typically take from numeric optimization is meant for rather different situations than what we have here. Searching the (=one) best model may or may not be appropriate. A situation with a true global optimum may be expected when optimizing the complexity of essentially the same model (say, the ridge parameter). Thus, a situation that may be described as selecting one of a continuous family of models. But if the compared models span a variety of model families, I don't think a finding that a number of model families can achieve the same performance should be too surprising at all. In fact, if I found a logistic regression, LDA and linear SVM to perform equally well, the conclusion would be "linear classification works" rather than thinking how these models differ in their stability depending on the training cases. And still, I don't see why a non-linear model shouldn't perform as well if sufficient training data is available.

From a philosophical point of view, I'd say there's nothing that keeps nature from having tons of influencing factors and interactions between them. So parsimony doesn't make the model more true, it just

guards against overfitting. So iff the model valiation is done properly on independent cases, we don't need this safeguard as overfitting is suitably penalized. In practice, however, cross validation frequently doesn't achieve as independent splitting as we'd like to believe - so an additional safeguard is a very sensible precaution in practice

In theory, there is no difference beteween theory and practice.
In practice, there is. In that sense, I think that Occam's Razor is more important for us (modeling folk) than for the models: we humans are known to be notoriously bad at detecting overfitting. I'm an optimist, though, and think that detecting overfitting can be learned. :-D
it also allows us to construct predictive models that achieve reasonable prediction based on a few input variates (possibly easier assessment), and that are possibly easier to study, say, in terms of what part of the input and model space are actually populated by our data. In addition, such models may be more easily correlated (or augmented) by independent/external knowledge.

Thanks for the very comprehensive answer. My worry about parismony--especially here where the nature of the input varies--is that a huge DNN could effectively recapitulate a simpler model: early layers do the feature extraction; later ones could then run the model. This seems like a valid concern for an explanatory model, no? — Matt Krause, Dec 14 '18 at 18:25
Matt: of course - isn't that roughly/simplified how we learn? Also performance-wise there may be nothing wrong with that, if sufficient training samples are available (which at least in my field would be extremely unlikely). As for explanatory power in the sense that we can understand the model - there the "simple" one is obviously better. But wouldn't such understandability be another performance metric? I.e. something that - if desired - should be specified as desired (like desired restrictions on computational resources etc.) — cbeleites unhappy with SX, Dec 16 '18 at 22:35
Also, what you say points to the "simple" model may not being that simple. After all, the preprocessing/feature generation is usually found in a rather complex process (which may even include a lot of experience...) — cbeleites unhappy with SX, Dec 16 '18 at 22:37

score 1 · Answer 2 · answered Dec 11 '18 at 00:36

In the absence of a controlled experiment, it is impossible to favor one of two models with equivalent fit, if parsimony is disregarded. It may be more interesting to interpret the more complex model, but that is story telling. From an agnostic point of view, both yield the same picture of reality.

Does cross validation say anything about parsimony?

2 Answers2