Nested nested cross validation for model selection

Question

Suppose I have decided to evaluate the following model selection procedure (let's call it PROC(1) )

START PROCEDURE:

For alpha in [0,1,2...1000]: Get the K-fold cross validation error of M(alpha) (the model parametrized by the hyperparameter alpha)

Pick alpha* such that M(alpha*) had the best cross validation error. Fit M(alpha*) on the full data.

END PROCEDURE

We know the cross-validation error of M(alpha*) is going to be biased upwards compared to the true generalization error of PROC(1) because PROC(1) was used on the entire data.

Thus, we need to use nested cross validation to get an estimate of the generalization error of PROC(1). Let us supposed we apply cross validation to PROC(1) and determine that it is not a good model selection procedure. We could try another procedure, call it PROC(2) (perhaps expanding or narrowing the grid of hyperparameters, increasing or decreasing the family of models) and estimate its generalization error.

The problem is that if we select the best procedure from {PROC(1),PROC(2)...PROC(M)}, the nested cross validation applied to the select procedure, PROC(J) will no longer be unbiased. Thus it would seem like we would need to apply nested nested cross validation but this would cause another problem (which subset of PROCEDURES to use?)

And it seems unrealistic to hope to get the best PROC on the "first try," so to speak. So what are the strategies for dealing with this issue?

https://stats.stackexchange.com/questions/282954/does-changing-the-parameter-search-space-after-nested-cv-introduce-optimistic-bi/283494?noredirect=1#comment543984_283494 — sjw, Jun 08 '17 at 20:50

score 4 · Accepted Answer · answered Sep 18 '16 at 18:26

4

I had a brief exchange about this issue with Fred Shic when he was on my thesis committee. I had mentioned in passing in my thesis that I was trying only a few models because if I tried a very large number, without using an outer cross-validation loop as you've suggested or something else, then some might happen to do well by chance. Shic replied (personal correspondence, 10 Sep 2015):

The MCP issue [i.e., the multiple-comparison problem] holds strongly for independent IVs, but given that these are all classes of optimizers, they are unlikely to be completely independent, and actually are more likely to be highly correlated in terms of performance. While I agree that there is an issue of "pseudo-overfitting" due to looking at too many models, it's not the same issue as mining for a statistically significant effect[…]

In any case, this was a minor point, and you are not wrong.. you don't want to put in a trillion random fit programs because as you mentioned one of those will magically hit the great training and validation performance. In practice, this is unlikely unless you're really trying to go for it.

So I think the answer is that this should only be a real problem if you fit a very large number of PROCs. You can avoid this by adding only qualitatively different PROCs; for example, one might be a random forest and another a support vector machine. If you want to adjust a detail such as the grid of hyperparameter values, treat that as another hyperparameter for an existing PROC (or modify the PROC in place) instead of creating a new PROC and pitting it against the old one on ostensibly equal terms.

answered Sep 18 '16 at 18:26

Kodiologist

19,063
2
36
68

In my experience, this is not correct, see section 4.3 of my paper https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf for a real example. – Dikran Marsupial Nov 03 '20 at 14:32
@DikranMarsupial Are you claiming to be Gavin C. Cawley, or Nicola L. C. Talbot? Your about-me is currently blank. – Kodiologist Nov 03 '20 at 19:45
yes, I am Gavin. Overfitting the model selection criterion can have substantial effects even ins situations that might seem benign (e.g. the use of the "median" protocol that was often used in machine learning - indeed I found out about this because I couldn't understand why my results were so much worse than someone else who was using the median protocol, which [falsely] ameliorates the overfitting). – Dikran Marsupial Nov 04 '20 at 14:34
I strongly agree with the last bit though - any tuning of hyper-parameters should be viewed as an integral part of the model fitting procedure. However if you change the tuning procedure, then that is changing the model fitting procedure and ought to be a different PROC. I think it is important to be careful to avoid "researcher degrees of freedom" leading to hidden biases. Sadly how many PROCS is too many is data dependent. – Dikran Marsupial Nov 04 '20 at 14:40
Unfortunately I don't think there is a good general solution to this problem and perhaps it is sometimes best just to acknowledge that there may be biases in the performance evaluation and stick to a relatively reproducible performance evaluation. – Dikran Marsupial Nov 04 '20 at 14:41
1

@DikranMarsupial Ah man, this is like the statistics equivalent of "Reflections on Trusting Trust". – Kodiologist Nov 04 '20 at 17:08
I view it more as the easiest person to fool is yourself. They say familiarity breeds contempt - and nobody is more familiar with my work than I am ;o) – Dikran Marsupial Nov 09 '20 at 13:45

Nested nested cross validation for model selection

1 Answers1

Linked