1

I have a regression problem where the number of samples n is less than the number of features p (e.g., p=500 and n=400; but the problem can be extended to p=3000+ and n=400). The features are largely normally distributed and include a few outliers. They are also continuous. All variables are informative to some degree, although there is (imperfect) multicollinearity. About 2/5 of the features seem to be sufficient to estimate the target with high accuracy (in the sense of the MSE).

I would like to select (not extract!; hence no autoencoder, PCA, etc.) those features that have the highest explanatory power. Irrelevant variables (e.g. those that have a variance of zero) and those that are renundant (i.e. those that correlate strongly with each other) should be removed.

To achieve this, I have thought of the following methodology:

  1. Split the data set into training and test data.
  2. Use sklearn's K-Fold method to prepare the training dataset for cross-validation.
  3. Train a model such as a ridge regression including hyperparameter optimization (determination of lambda) using cross-validation
  4. Use this model (only the determined lambda, not the trained model) within the RFECV (recursive feature elimination with cross validation) to determine the number of "necessary" features (this is about 2/5)
  5. Use this model within the RFE (without cross-validation; whole training data set) to determine the 2/5 most relevant features.
  6. Use these most relevant features in the final model to predict y.

My questions are as follows:

  1. Is there anything wrong with this approach? I know that there is a data leak or overfit in the model building in 6) if the feature selection [steps 3) - 5)] is done on the same dataset. This could be circumvented if I perform the RFE within each fold during model building.
  2. Is there a reason why ridge regression is rarely used in conjunction with RFE? Especially in the case when p>>n? In contrast, RFE is very often (especially in medical problems) combined with SVM (support vector machines or regressor).
  3. RFE is a variant of backward selection, where instead of a score (like AIC, CV-score, t-test) the feature importance/size of the coefficients is used to remove the least relevant feature from the model in each step. It is said that the procedure is only applicable in the case of n>=p. This is not a problem of the procedure, but a problem of the underlying prediction model? In a normal multiple linear regression, there is no unique solution for p>>n. Therefore the coefficients vary strongly - a selection on the basis of these is not purposeful? This problem should be eliminated by using ridge regression and especially when using SVR?
Shayan Shafiq
  • 633
  • 6
  • 17
Pablo
  • 13
  • 3
  • There are many similar posts, did you search this site? See https://stats.stackexchange.com/questions/328630/is-ridge-regression-useless-in-high-dimensions-n-ll-p-how-can-ols-fail-to https://stats.stackexchange.com/questions/218208/what-are-the-advantages-of-stepwise-regression – kjetil b halvorsen Dec 16 '21 at 21:32

0 Answers0