More features = worse performance?

Asked Nov 02 '21 at 01:16

Active Nov 02 '21 at 01:16

Viewed 28 times

I have a small dataset that I am trying to train LASSO and random forest models on. I do nested CV to tune hyperparameters and make unbiased performance estimates. The total number of features is greater than the number of observations.

The resulting model performance (as per nested CV AUC) decreases the more features I add in for consideration. What is the likely reason for this? With more candidates to choose from, perhaps each fold of the outer CV builds a model with increasingly varied selections that don't generalize well? LASSO and forward selection seem much more sensitive to this than the random forest.

My models do great if I start with only the univariately significant ones by pearson correlation, but I know I'm really not supposed to do that first outside of the validation... Any suggestions?

asked Nov 02 '21 at 01:16

Michael Connor

1

overfitting--the models likely perform really well on the analysis set of each fold, but poorly on the assessment set. – wz-billings Nov 02 '21 at 04:15
Can a model be called "overfit" when there is only a single term selected in each fold, for example? – Michael Connor Nov 02 '21 at 15:45

More features = worse performance?

0 Answers0