2

Reading this, Cross-validation including training, validation, and testing. Why do we need three subsets?

I realized that if we can reduce the variance of the model performance, I wouldn't need the test set.

And we can get reduced variance of the model performance by merging validation and test set to get the model performance.

So I wonder if it's true that if model selection is not so competing enough to allow the probability of picking one best model by chance, would it be better to just merge the validation and test set, use something like bootstrapping or CV to get estimates of variance of model performance?

But then again, when the models are competing each other very hard, if we merge the validation and test set together, we would be better at picking the best model.

So I think if we report the variance of model performance, It would be better to just merge the test and validation data and pick the best model and report the variance of the model performance.

Is this conclusion right? or are there some holes?

KH Kim
  • 1,121
  • 2
  • 10
  • 25

1 Answers1

3

Your conclusion's right if you do not want to know what will be the resulting accuracy of your model. You will always get a better model by increasing the size of the validation set. Then, merging the test and validation test to select the better model by CV can be tempting.

But when your boss will ask you "So, what accuracy can I expect with that model of yours, given new data ?", you shouln't give him the score you got with the merged test-validation set, because when selecting the best model, you kind of overfitted your merged set

So you will need to finally test your model on brand new data that you never learnt from before, and that is the role of the test set.

Jacquot
  • 131
  • 3
  • Good point! But actually that part is what I don't understand. Let's say I have very small test set. Then I can't say the predicted accuracy is accurate bcoz of randomness it has. Again, let's assume we have infinite validation set. Then we can't overestimate the accuracy of best model. So as we get larger validation set, the test set is meaningless. But as we have small validation set, we will need enough test set to get enough accuracy. So there's some trade off going on. I think the biasedness comes from selection by chance. – KH Kim Apr 08 '16 at 16:25
  • If the candidate models are of almost same performance, then there's chance that one of the model is selected by chance and the performance is biased upward. But if there's almost zero chance that other models are selected, then there's no bias in estimated model performance from validation set. So we don't need test set. Do I make sense here? – KH Kim Apr 08 '16 at 16:27
  • So, are there any research about how much we should put data in validation set and how much in test set? It should be based on data but it looks not trivial... – KH Kim Apr 08 '16 at 16:53
  • There is no chance at play here. If you choose a model because it's better in the CV set, you chose it because it is better on the CV set, period. But the simple fact that you chose the model from accuracy on the CV set, makes you overfitting the CV set. You will never be able to say that _there's almost zero chance that other models are selected_, because you don't have the extra data to prove it. – Jacquot Apr 10 '16 at 00:14
  • So you know you're overfitting the CV set, no matter what, when selecting the best model. And you need new data to know what could be the actual accuracy of your model on data you didn't overfit – Jacquot Apr 10 '16 at 00:15
  • I don't think you understood me. Biasedness comes from randomness. Let's say we have N(0, 1^2) and N(0, 1^2). and the larger sample is biased upward because it's chosen from almost same distribution. But Let's say we have N(10,1^2) and N(-10, 1^2). Then the first one will be always picked and unless the second one is picked, it's never biased. And we can estimate the population variance, Of course it's just estimation. – KH Kim Apr 10 '16 at 00:58
  • Anyhow I've been looking around and found this http://www.win-vector.com/blog/2015/10/a-simpler-explanation-of-differential-privacy/ – KH Kim Apr 10 '16 at 00:59
  • from this post http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set?rq=1 – KH Kim Apr 10 '16 at 01:00
  • But performance from test set is also an estimation, but unbiased. and ubiased but having large variance estimator can be useless :) – KH Kim Apr 10 '16 at 01:12
  • That makes me reason why they recommend proportion of the validation and test set to be 1:1. Say you picked the better model better by allocating more data on validation set, but the estimated performance from test set will be poor(large variance). Say you want to be more precise about the performance by allocating more data to test set, in this case your chosen model would be not the best even if you have precise estimation of the model performance... – KH Kim Apr 10 '16 at 01:29
  • That's it. As for many problematics in science, it's a trade-off, and you have to chose the best proportion depending of what your goals are. – Jacquot Apr 11 '16 at 12:08