Model fitting: resampling the validation set to obtain distributions of test statistic

Question

I see many descriptions of splitting the data set into a training part, a validation part and a test part. We train our models on the training part and choose the best model using the validation part, finally seeing how the best model performs on the test set. We choose the best model in the validation part using some kind of test statistic, say MSE. But what do we do when the MSE for say two models are really close? From the law of parsimony I might want to choose the most parsimonious model (of, say, two competing models) even though the MSE for the parsimonious model is a bit higher. I propose here a model selection method:

Algorithm would be like this:

1) Train your models on the training set
2) Sample with replacement the validation set into K validation sets 
3) Predict using your models on each validation set
4) Calculate the MSE/MSPE for all the models on the K validation sets
4) Calculate the MSE/MSPE distributions from your K MSE/MSPE-calculations

In a scenario where the MSE/MSPE distributions of two competing models are more or less overlapping, I would choose the most parsimonious model. It would basically be a test of $H_0: \text{Predicative capabilities of parsimonious model equals complex model}$.

If the mean of $\text{MSE}_1$ is well within the 95th percentile of distribution of $\text{MSE}_2$, we choose the most parsimonious model of the two, regardless of which model has the lowest MSE.

Question: Does this method make sense to anybody else but me? Also, is this described anywhere else in the statistical litterature?

EDIT: It might seem like a similar question is asked here

What do you specifically mean by "resample the validation set?" Choose a different 30% of the data for the validation phase? — Matthew Gunn, May 09 '16 at 19:24
@MatthewGunn: I mean resample the set with replacement *n* times, *n* being number of observations in the validation set. Effectively this is bootstrapping the MSE statistic on the validation set (including the predictions). — Erosennin, May 09 '16 at 19:35
I'm confused as to how your second paragraph helps address the first. — shadowtalker, May 14 '16 at 20:51
@Erosennin this is how I'm reading your proposal: "the MSE for two models is very similar, so I'm estimating the _distribution_ of MSEs by bootstrapping to get a better sense of how they differ." is that a fair summary? — shadowtalker, May 14 '16 at 23:40
@ssdecontrol: absolutely fair summary! I tried explaining that above in a comment. I'm basically just insecure, because I've never seen this been done before. — Erosennin, May 15 '16 at 07:47
@Erosennin I'll try to post something helpful when I get home, but basically the answer is "of course you can do that." — shadowtalker, May 15 '16 at 18:19
I re-read the question and your proposed solution, and now I'm not really sure. — shadowtalker, May 16 '16 at 00:03
@ssdecontrol ok! That is good! Are you able to put your finger on what worries you, or what is problematic? — Erosennin, May 16 '16 at 05:08

score 2 · Answer 1 · answered May 13 '16 at 21:18

2

Not having thought too hard about this, my first guess is that the mean MSE from your bootstrapping procedure, given enough replicates, would equal the plain old MSE for the validation set.

Stepping back, though, I don't see how the bootstrapping procedure would help with your original problem, "what do we do when the MSE for say two models are really close?" In that situation, you can pick the model that has the (however slightly) lower MSE, or the simpler model, or whatever you like, really. Since the MSE is very similar, the models should have very similar predictive performance, so it shouldn't matter so much which you pick. You can also hunt for other useful ways the models might differ, as in this paper (Arfer & Luhmann, 2015), where I tried shrinking the dataset and adding noise to see if some models were more robust than others.

Arfer, K. B., & Luhmann, C. C. (2015). The predictive accuracy of intertemporal-choice models. British Journal of Mathematical and Statistical Psychology, 68, 326–341. doi:10.1111/bmsp.12049

answered May 13 '16 at 21:18

Kodiologist

19,063
2
36
68

The point is differentiating between the predicative capabilities of the two models. A model which has slightly lower MSE is not necessarily a better model (MSE is a random variable), especially when it is your interest to always choose the most simple model (law of parsimony). – Erosennin May 14 '16 at 00:52
1

Yes, your obtained MSE is only an estimate of the population MSE. But you can't get a better estimate than what you're already getting, unless you collect more data. – Kodiologist May 14 '16 at 01:13
I suspect I'm not making my point good enough. Sorry about that. I do not want a better estimate, and I agree that i can't get a better estimate (without more data). What I am interested in is the MSE distributions on the validation set. If the difference in MSE is small, and the MSE distributions are highly overlapping, this would surely indicate that the difference between MSE is not significantly different. Would you not agree? What I'm asking is basically if it is OK to sample the validation set with replacement creating, say, 1000 new validation set which I then do my predictions on? – Erosennin May 14 '16 at 09:49
"Would you not agree?" — I don't know. I'm not a big fan of significance testing, and one reason is that it can be surprisingly difficult to choose an appropriate test and interpret it correctly. Besides, if a test fails to reject a null hypothesis, you learn nothing. If you want a formal model-selection procedure, why not use AIC or BIC instead of trying to create your own? – Kodiologist May 14 '16 at 11:37
lets just say that obtaining insight to my posed question is what is of interest to me. – Erosennin May 14 '16 at 11:51

Pieter · Answer 2 · 2016-05-15T20:43:50.203

1

There are enough statistical tests to decide if one model preforms better than another model. In your case you want to know if the mean of the squared errors of one model is significantly better than the mean of the squared errors of another model (your H0 hypothesis). It makes sense to choose the simpler model if there is no significant difference (Ockham's razor).

You can test this hypothesis without bootstrapping because you know all the seperate error terms that make up the MSE. These seperate error terms are all independent estimates of the MSE. Bootstrapping is useful if you cannot use the population to get an confidence level of the estimate, for example with accuracy. But the variance of the MSE is well defined.

A common way is to describe the certainty of the MSE is the standard error (https://en.wikipedia.org/wiki/Standard_error):

$\text{SE}\ = \frac{s}{\sqrt{n}}$

This is basically the uncertainty of your MSE estimate. You can compute each MSE and look if the difference between the MSE's is higher than the sum of their standard errors to see how significant the difference is. But you can also use a t-test or ANOVA for this.

edited May 15 '16 at 20:43

answered May 15 '16 at 13:10

Pieter

1,847
9
23

I am having difficulties grasping what you are trying to explain in your second paragraph. Yes, the separate error terms defines the MSE. You write that **These seperate error terms are all independent estimates of the MSE.** What exactly do you mean by this? And what do you mean by **Bootstrapping is useful if you cannot use the population to get an confidence level of the estimate, for example with accuracy. But the variance of the MSE is well defined.** ? – Erosennin May 15 '16 at 16:25
1. If you have a single prediction and target than you can compute the MSE, which is just the squared difference between the prediction and target. This single (prediction, target) pair is then an estimation of the MSE of the model. 2. I say you do not need bootstrapping to get what you want. If I interpret you correctly you want an estimate of the uncertainty of the MSE prediction such that you can use it so see if two models have significantly difference performance. For this, you can use the standard error. – Pieter May 15 '16 at 17:03
1

Ok, so I have my estimate of the MSE for one of my models. How do I calculate the standard error of the MSE? – Erosennin May 15 '16 at 17:45
SE = $s / \sqrt{n} $, where $s$ is the sample standard deviation and $n$ the number of samples. As an example, you are five times more sure about the mean (squared error) if you have a sample of size 100 compared to a sample of four. – Pieter May 15 '16 at 20:39

Model fitting: resampling the validation set to obtain distributions of test statistic

2 Answers2