0

How to choose the best variables?

I'm training a regression model with single tree and ensemble methods (bagging and random forest) to make a prediction. In the exploration phase I found different correlations, with Pearson, between the response variable and the independent variables.

For example:

  • var1 0.97;
  • var2 0.97;
  • var3 0.95;
  • var4 0.95;
  • var5 0.76;
  • var6 0.72;

In some cases there are a very high correlation between the independent variables but since I'm using a tree based model with also ensemble methods I decided to not remove the variables.

But when I train my model I obtain every time a very high variance. For example with a random forest model trained only on the first variable (var1) I obtain an explained variance of 97.11 %.

So my question is, should I discard the first 4 variables? Because I think that the first 4 variables influence my model a lot. Is this an overfitting problem?

How to compare different models?

I'm comparing with RSME (Residual Squared Mean Error) the single tree with the ensemble methods. Is this a good practice?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 2
    I've voted to move this to [stats.se] because it's about stats methods rather than programming, but I have a feeling folks over there will tell you it depends on the situation. It's really hard to know how to build a model without any information about what the data is or what the purpose of modeling is – camille Jan 22 '22 at 16:20
  • Frank Harrell has written about this many times. The gist is that variable selection is unstable. Why not use all of your variables? – Dave Jan 23 '22 at 23:19

0 Answers0