choosing an algorithm without nested cross-validation

Question

As far as I understood, when implementing a learning algorithm that integrates model selection/hyper parameters tuning in itself, nested cross-validation is necessary to lower the bias in the performance estimate.

I’d propose to summarize the algorithm to compute the estimation of that performance as:

performances = [] for training, test in partition(data): model = find_best_model(data_to_choose_best_model=partition(training)) performances.append(model.fit_and_measure_performance(training, test)) return some_method_to_aggregate_for_ex_average(performances)

As we don’t have an infinite amount of time, we’re obliged to restrict the number of model/parameters to browse during find_best_model. Taking aside the fact that we don’t use the models we don’t know, I’d enumerate two ways of selecting that subset of model/parameters:

experience/gut feeling,
exploration/plotting some curves to evaluate how an algorithm reacts to a given data.

My question is the following: Is there is a way to implement 2., for example, in the way to select/explore the data, that would permit lowering the bias it creates ?

Indeed, implementing 2 ourselves, i.e. out of the “find_best_model” method in the algorithm above, seems to be a “seemingly benign short cut” that may induce a non negligible “magnitude of [...] bias” (taking expressions from the very instructive first answer in Use of nested cross-validation). Said otherwise, it seems similar to tuning hyper parameters without going through nested cross-validation.

The proposal seems to vague to really comment. Generally if you don't need a reliable performance estimate, then you don't need nested cross-validation (provided your models have approximately the same number of hyper-parameters to tune of comparable sensitivity). If you do need an unbiased performance estimate, then you need some data that hasn't been used in *any* way, even indirectly via the operator, to make any choices about the model. — Dikran Marsupial, Aug 19 '16 at 17:30
Let's assume I need an unbiased performance estimate to choose a model. In that case, what you suggest is that for a given use case, to know whether for example, random forest or svm should be applied, either the find_best_model chooses it, or another set of data should be used to make that choice ? If that's so, isn't it better to let 'find_best_model' make that choice ? — yetanotherion, Aug 19 '16 at 18:41
You don't need an unbiased performance estimate to choose a model (you can add an arbitrary constant to the estimates and you will still choose the same model). Ideally you want a low variance estimator, so that the bias introduced by using it to tune the hyper-parameters will be small and hopefully approximately the same for all models under consideration (i.e. the same number of hyper-parameters and the number of hyper-parameters being small, similar sensitivity to hyper-parameter settings). — Dikran Marsupial, Aug 20 '16 at 14:18
Thanks for the answer. What I still don't get though, is why on one hand we can expect that choosing svm over random forest may generate a negligible bias, and on the other hand, tuning the parameter of one of the models will not. We can't have a low variance estimator to choose hyperparameters ? Moreoever, if we add the 'choice of the number of features' to the issue: How can we know (besides experience/asking questions to experts) whether we can have a low variance estimator to choose the number of features ? — yetanotherion, Aug 21 '16 at 17:47
optimising the hyper-parameters of a model via cross-validation will always mean the cross-validation estimate becomes biased (as part of the reduction in CV error will be due to the random sampling, rather than true improvement in generalisation). However, the SVM tends to be more sensitive to its hyper-parameter settings than the random forest, so the bias will tend to be greater for the SVM than RF. We do want a low variance estimator, and can get one (e.g. bootstrapping) but it is usually expensive. Feature selection usually makes these problems much worse. — Dikran Marsupial, Aug 22 '16 at 09:08
I have answered other questions on feature selection, if you use a regularised model, e.g. SVM, then feature selection more often than not makes generalisation performance worse, not better, provided the regularisation parameter is tuned properly. — Dikran Marsupial, Aug 22 '16 at 09:09
Thanks again for this answer. What would you say about the following summary: To choose a model over another one, a low variance estimator is mandatory. However, if the size of the data used to estimate the performance is low, compared to the size of the "parameter space"? we aim to browse, then we're hopeless, as the number of samples will be too low to handle the variance. Said otherwise, the chances to get unlucky by estimating a value quite far from the real one, (i.e. not choosing the best model in that case), are high ? — yetanotherion, Aug 22 '16 at 12:14
By reading what is suggested in http://jmlr.org/papers/v11/cawley10a.html (that you wrote if I got other replies in stackexchange right :)), in the case of very small datasets, there are three options. a fully Bayesian approach (did not read the article yet), random forest, or get more data. Regarding the random forest method, is it adapted because there is a single hyper parameter to set (the number of trees in the forest, if I understood correctly) , or more precisely because setting that hyper parameter is not sensible to the variance induced by an estimation done with very few data ? — yetanotherion, Aug 22 '16 at 12:23

choosing an algorithm without nested cross-validation

0 Answers0