I am currently working on project where I need to train a logistic regression classifier with a combined $l_1$/$l_2$-penalty that satisfies a hard on the number of features. Specifically, my dataset contains over 100 features, and I need to fit a model that uses at most 5 features using the glmnet
package in R.
I am wondering if I can design the model training / selection process so that I can: a) fit a single model that will satisfy the constraint on feature size; b) obtain an unbiased assessment of predictive accuracy.
Right now, the best setup that I have come up with is as follows:
1) Save 20% of the data for testing.
2) Training / Validation: Use 80% of the data for training/validation as follows. Say there are $M$ unique combinations of free parameters. For each unique combination of free parameters:
Run a $K$-CV: that is, train $K$ models using subsets of the training/validation data to assess predictive accuracy for this combination of free parameters.
Train a final model using all of the training/validation data.
In this way, we end up with $M$ models, each with a $K$-CV estimate of predictive accuracy.
3) Model Selection and Evaluation: I pick one of the $M$ models that was trained with all training/validation data. Specifically, I pick a model that a) satisfies the model size constraint; b) optimizes some $K$-CV metric of interest. I evaluate the predictive accuracy using the test set that I put aside in the beginning.
This setup works well in that it satisfies the model size constraint, and produces a single model, but it only produces a point estimate of test accuracy (so we do not get an idea of the variance). It does also produce a $K$-CV estimate of predictive accuracy from the validation procedure, though this is clearly biased since the model was picked using this statistic.
Note: I should add that I am mainly interested in knowing if I can design the training process to satisfy such a constraint. That is, I'm not interested in using another method for feature selection, or using the hard cap on the # of features in glmnet
's settings (glmnet
can produce models with different levels of sparsity by tuning the lambda
parameter and that is all I would like to do).