How to select the best variables to train the best model for regression problems?

Question

How to choose the best variables?

I'm training a regression model with single tree and ensemble methods (bagging and random forest) to make a prediction. In the exploration phase I found different correlations, with Pearson, between the response variable and the independent variables.

For example:

var1 0.97;
var2 0.97;
var3 0.95;
var4 0.95;
var5 0.76;
var6 0.72;

In some cases there are a very high correlation between the independent variables but since I'm using a tree based model with also ensemble methods I decided to not remove the variables.

But when I train my model I obtain every time a very high variance. For example with a random forest model trained only on the first variable (var1) I obtain an explained variance of 97.11 %.

So my question is, should I discard the first 4 variables? Because I think that the first 4 variables influence my model a lot. Is this an overfitting problem?

How to compare different models?

I'm comparing with RSME (Residual Squared Mean Error) the single tree with the ensemble methods. Is this a good practice?

I've voted to move this to [stats.se] because it's about stats methods rather than programming, but I have a feeling folks over there will tell you it depends on the situation. It's really hard to know how to build a model without any information about what the data is or what the purpose of modeling is — camille, Jan 22 '22 at 16:20
Frank Harrell has written about this many times. The gist is that variable selection is unstable. Why not use all of your variables? — Dave, Jan 23 '22 at 23:19

How to select the best variables to train the best model for regression problems?

0 Answers0