Order of hyperparameter optimisation and feature selection in a nested CV structure

Question

I am currently trying to build an ML pipeline for fMRI data. To get an unbiased estimate of the performance of my pipeline, I use nested cross-validation structure. However, I am not sure in what order hyperparameter optimization and feature selection should be in a nested CV structure. I have four options (but always open for good options):

3-loop nested cross-validation.

Outer loop: Model evaluation
Middle loop: Feature selection
Inner loop: Hyperparameter optimization

3-loop nested cross-validation.

Outer loop: Model evaluation
Middle loop: Hyperparameter optimization
Inner loop: Feature selection

2-loop nested cross-validation

Outer loop: Model evaluation
Middle loop: First feature selection, then hyperparameter optimization

2-loop nested cross-validation

Outer loop: Model evaluation
Middle loop: First hyperparameter optimization, then feature selection

Please keep in mind that, fMRI studies generally consist of a small sample (around 20-30 per category) unless you are working on connectomics projects.

Why Does feature selection have to be done before or after hyperparameter selection? Wouldn’t it be most correct if we just consider our feature set as a hyper parameter and do it all at once? — astel, Aug 12 '19 at 20:11
I would definitely think about that. Could you describe that more? — afarensis, Aug 12 '19 at 20:27
What I am suggesting is that choosing which features to include in your model is the same thing as choosing which values to use for your hyper parameters. The choice of using features a,b and c vs. using a,b and d is the akin to choosing say the value for lambda in a lasso regression. — astel, Aug 12 '19 at 20:43
Thank you for your suggestion. What I get from your comment is to use an embedded method instead of performing feature selection and hyperparameter optimization for non-regularized ML models. If so, I can not embedded method such as elastic net since I also use in-house feature selection pipeline combining network-based statistics and wrapper model. — afarensis, Aug 12 '19 at 21:19
That’s not at all what I am suggesting. What I am suggesting is that your variables ARE hyperparameters. Conceptually and practically there is no difference between selecting which features to use and which hyperparameters values to pick — astel, Aug 12 '19 at 21:29
Ah okay, I think I get that now. So, what you suggest is to run hyperparameter optimization (random search, Bayesian optimization, etc.) over feature selection and classifier hyperparameters at once. — afarensis, Aug 12 '19 at 22:28
Does this answer your question? [How should Feature Selection and Hyperparameter optimization be ordered in the machine learning pipeline?](https://stats.stackexchange.com/questions/264533/how-should-feature-selection-and-hyperparameter-optimization-be-ordered-in-the-m) — skeller88, Apr 16 '20 at 18:09

score 3 · Accepted Answer · answered Aug 13 '19 at 07:16

3

@astel's advise is spot on: which features to use is part of the hyperparameters your model has, thus

Outer loop: Model evaluation
Inner loop: hyperparameter optimization, including feature selection

is the way to go.

The important reason behind that is that feature selection and optimization of other hyperparameters are usually not independent, and thus should be optimized together:

for different sets of features, the optimal other hyperparameters may be different, and
for different sets of other hyperparameters, the optimal features may be different.

In such a situation, optimizing one after the other may miss the global optimum (such a situation is the introductory example in many design of experiment courses about why you should not optimize one factor after the other unless you have external knowledge that their influences on the system under study are independent).

answered Aug 13 '19 at 07:16

cbeleites unhappy with SX

34,156
3
67
133

Thanks for the answer. Is it possible to use some evolutionary algorithms (say genetic algorithm) to search simultaneously the feature and the other hyperparameters in the inner loop? – uared1776 Sep 02 '19 at 15:27
@uared1776: yes you can do this. But those algorithms tend to compare tens of thousands of models. Unless you have a truly *huge* sample size and guard against overfitting, you'll run a high risk of substantial overfitting being caused by the combination of such massively multiple comparisons together with random uncertainty on the observed performance. – cbeleites unhappy with SX Sep 02 '19 at 16:03
I am thinking of a multiple-objective optimization, such that obj. #1 is the prediction error in the inner loop, and obj. #2 is a function related to model complexity. Thus, the trade-off between model complexity and CV accuracy in inner loop can be derived (in the form of Pareto front). Less than 1000 combinations of the hyperparameters will be evaluated in each inner loop, given a total of 10k samples. The overfitting risk IMHO is not high, and it is possible to compare the final generation models using test data in the outer loop (this info is for understanding but not for model selection). – uared1776 Sep 02 '19 at 16:40

Order of hyperparameter optimisation and feature selection in a nested CV structure

1 Answers1