cross-validation analysis not diagnostic

Question

I'm using k-fold cross-validation analysis for model selection, however, it does not appear to favor any particular model. There are several variants of the models and two of them are nested within (i.e., more restrictive version of) the third. I've tried using a different numbers of fold (2, 5, and 10) and multiple iterations (up to 100) with random splits of the data, but this does not appear to make a difference. I'm using the mean squared error of prediction (MSEP) to compare the models as well as the standard deviation of the squared error of prediction across iterations to get a sense of how noisy the MSEP is. So for, instance, the MSEP for model A and B may be .045 and .054, respectively, but they are both within one SD of each other. This makes me think that these differences are just random.

Does anyone have a sense of how to interpret this? If a more flexible model does as well as a more parsimonious model in cross-validation, does this mean that the simpler model should be favored? Or is possible that the cross-validation analyses are not diagnostic for these data? The number of observations is in the thousands and the data are used to construct proportions within different categories.

The simpler model is (almost) always preferred. What is your sample size? Can you show us some plots? It may be, as you mentioned, that your model isn't really learning anything and this is all random. — user2974951, Jan 07 '19 at 09:51
Show us how model prediction estimates change with various CV parameters, so for ex. x-axis is k=2,5,10,... and y-axis is MSEP. Averages and standard deviations. — user2974951, Jan 08 '19 at 09:06
The link is pasted below. The number of observations is in the thousands. Model A is nested in Models B and C, and Model B is nested in Model C. — user233241, Jan 09 '19 at 21:27
There is almost surely no difference between the 3 models, they are very similar, but what is weird is that your MSEP *increases* when increasing fold number, which should not happen, the opposite should be true, so you may just be introducing noise by adding more variables. — user2974951, Jan 10 '19 at 07:30
Thanks very much for your response. If I understand correctly, the fact that MSEP increases suggests that bias is increasing, and this shouldn't happen with increasing fold number. Only the variance should increase with increasing k, but both seems to be happening for these data. Is this correct? I'm not sure I understand the explanation for this though...the number of variables across these different fold sizes is the same, so why would this add noise? — user233241, Jan 10 '19 at 08:05
It wouldn't, never mind that statement. Whether bias or variance should increase / decrease with different number of folds is not really clear. Have a look at https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation. Overall, there does not seem to be any difference between the three models, so any one of them is as good as the other. — user2974951, Jan 10 '19 at 08:38

score 1 · Answer 1 · answered Jan 11 '19 at 12:35

If you are comparing/selecting from 4 models that were all set up according to your knowledge of the application at hand as good and sensible candidates for a good model, there's really nothing that will guarantee different performance. After all, there can be various models that reach Bayes error for a given application (which doesn't mean you got there, there are more possibilities that can lead to equal or very similar predictive performance of models of different type).

However, what made me write this late answer is:

I'd suggest looking a bit more into that relation of $MSE_{CV}$ (at least in my field, $MSE_P$ is reserved for test set predictions) vs $k$.

CV error really should not increase systematically with k,
and standard deviation across iterations (as opposed to std across folds) measures instability of your models (or more precisely, their predictions).

The fewer actual training cases available for surrogate model training may lead to worse models for small $k$, which may be more unstable and/or worse on average. However, you observe the opposite.

This could be a symptom of some programming mistake, and programming mistakes when calculating model perforance can cause the whole verification/selection/optimization to go wrong.
(I'm speaking as someone who once retractred an already submitted manuscript after finding a tiny single character typo that created a data leak and caused severe optimistic bias in cross validation results).

Thanks. I've checked my code repeatedly and it seems to be doing what it's supposed to. The data I'm splitting are frequencies in different categories and I'm using them to calculate proportions. Is it possible that the MSE_CV is increasing with k because using very different sample sizes (and totals) to create proportions in the test and training set is creating a larger discrepancy between them (and consequently increasing the difference between the actual and predicted data)? I'm not sure if this explanation is circular. I think not, because it is specific to how these data are used. — user233241, Jan 14 '19 at 00:40
Following up on this, I checked directly and did find that the absolute value of the difference between proportions generated using the training and test data increases with K. I checked this by taking the mean of the absolute value across 100s of iterations. There does not appear to be any systematic direction to the difference because the mean of the raw difference fluctuates around zero for all of them. Given this, doesn't it make sense that parameters generated using the training set with larger k will underperform in prediction of data in the test set? — user233241, Jan 14 '19 at 01:15

cross-validation analysis not diagnostic

1 Answers1