K-fold CV based model selection with a constraint on the number of features?

Question

I am currently working on project where I need to train a logistic regression classifier with a combined $l_1$/$l_2$-penalty that satisfies a hard on the number of features. Specifically, my dataset contains over 100 features, and I need to fit a model that uses at most 5 features using the glmnet package in R.

I am wondering if I can design the model training / selection process so that I can: a) fit a single model that will satisfy the constraint on feature size; b) obtain an unbiased assessment of predictive accuracy.

Right now, the best setup that I have come up with is as follows:

1) Save 20% of the data for testing.

2) Training / Validation: Use 80% of the data for training/validation as follows. Say there are $M$ unique combinations of free parameters. For each unique combination of free parameters:

Run a $K$-CV: that is, train $K$ models using subsets of the training/validation data to assess predictive accuracy for this combination of free parameters.
Train a final model using all of the training/validation data.

In this way, we end up with $M$ models, each with a $K$-CV estimate of predictive accuracy.

3) Model Selection and Evaluation: I pick one of the $M$ models that was trained with all training/validation data. Specifically, I pick a model that a) satisfies the model size constraint; b) optimizes some $K$-CV metric of interest. I evaluate the predictive accuracy using the test set that I put aside in the beginning.

This setup works well in that it satisfies the model size constraint, and produces a single model, but it only produces a point estimate of test accuracy (so we do not get an idea of the variance). It does also produce a $K$-CV estimate of predictive accuracy from the validation procedure, though this is clearly biased since the model was picked using this statistic.

Note: I should add that I am mainly interested in knowing if I can design the training process to satisfy such a constraint. That is, I'm not interested in using another method for feature selection, or using the hard cap on the # of features in glmnet's settings (glmnet can produce models with different levels of sparsity by tuning the lambda parameter and that is all I would like to do).

@Firebug Good question. So I've thought about it. Nested CV resolves the point estimate issue because it produces multiple unbiased estimates of predictive accuracy (one for each fold in the outer-loop). The issue is that it also produces multiple models, and it isn't clear how to pick one of those models (at least according to [this answer](http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection)), or train a different feasible model. — Berk U., May 15 '16 at 20:17

score 1 · Accepted Answer · answered May 16 '16 at 15:32

Like @Firebug, I'd recommend nested cross validation if you go for data-driven model selection (unless you have thousands of independent cases at hand, so that a 20 % split still gives you decent testing accuracy).

Why not define the model training algorithm as choosing lambda so that 5 non-zero coefficients remain - this would allow to avoid the inner cross validation loop because you do not select based on an error estimate calculated using an optimization (validation) test set.

I'd anyways monitor the stability of the chosen features (as well as the resulting lambda) over the cross validation surrogate models. If the selected features are not stable, you probably don't want to use this as your final model anyways.

IMHO the question of how to pick a model in the (nested) cross validation is a misunderstanding: you don't pick a surrogate model from the outer loop of a nested cross validation any more than you pick a surrogate model from the single loop of a "normal" cross validation. In both cases, you typically train your final model with the same algorithm you used for the training within the outer loop (it doesn't matter whether your training algorithm internally does more cross validation loops or not) on all available data.

it only produces a point estimate of test accuracy (so we do not get an idea of the variance).

That's not exactly true (or rather, be careful how you interpret the variance you observe during cross validation). Your test estimate of accuracy has (at least) two contributors of variance:

Model instability (which you can check by comparing the coefficients of the different surrogate models within one run, or by comparing predictions for the same test case across different runs of $k$-fold CV.
If the model is actually to be used for prediction, you should make sure this is negligible.
This part you cannot measure with only a single model tested (as opposed to a bunch of surrogate models produced by the same training algorithm).
variance due to the limited number of test cases. If you use a performance measure that is a fraction (like accuracy) you can derive its variance via binomial distribution (and get confidence intervals e.g. using binom::binom.confint. This works also for a hold out test set.

K-fold CV based model selection with a constraint on the number of features?

1 Answers1