Questions to principal component analysis and recursive feature elimination

Question

A collegue of mine suggested to use PCA prior to RFE to reduce the dimensionalty (102 features versus 37 samples) and to get rid of the correlation problem - namely, if I use an RFE with a support vector regression (SVR) it can happen that the sparse solution arbitrarly chooses one feature out of two highly correlated features. Would you agree with that logic ?

The PCA returned 36 features that account for 100% of the variance. Therefore, I subjected those 36 features to an RFE with kfold = 5 and hyperparameter optimization using GridSearchCV. However, the RFE performs very bad. Scoring is the r2 score and I got an r2 score of -0.24.By looking at the folds, I see some very high r2 scores in the training set (0.7 - 0.9), but bad r2 scores in the test sets. Does that mean that my model is overfitted ? Do you see other reasons ?

many thanks for your help, mike

your PCA will always return at most 37 features for 100% of the variance because you're in a high dimensional setting so at most 37 columns could be linearly independent. With so few observations I think it's extremely dangerous to do such a high variance feature selection process and you're almost inevitably overfitting. Do you really need to do feature selection? Or maybe could you fit a penalized linear model? — jld, May 07 '18 at 19:29
Thank you Chaconne for your useful comments. I see the challenge. So in your opinion, PCA here is a bad decision ? Do you mean I should run e.g. an SVR with all 102 features and then use C regularization using cross-validation ? many thanks for your efforts, mike — mike, May 08 '18 at 19:15
I'd definitely give the answer here a good read: https://stats.stackexchange.com/questions/113994/ lots of good advice and links to other sources on model selection with small samples. This is just a challenging problem and I'd keep it absolutely as simple as you possibly can. I would second the main answerer's advice there in that any sort of data-dependent model selection (even tuning a single hyperparameter) is a tricky business with so few data points, and properly validating your model is really hard as you don't have the data for a hold out set. — jld, May 09 '18 at 13:57

Questions to principal component analysis and recursive feature elimination

0 Answers0