How to choose the best variables?
I'm training a regression model with single tree and ensemble methods (bagging and random forest) to make a prediction. In the exploration phase I found different correlations, with Pearson, between the response variable and the independent variables.
For example:
- var1 0.97;
- var2 0.97;
- var3 0.95;
- var4 0.95;
- var5 0.76;
- var6 0.72;
In some cases there are a very high correlation between the independent variables but since I'm using a tree based model with also ensemble methods I decided to not remove the variables.
But when I train my model I obtain every time a very high variance. For example with a random forest model trained only on the first variable (var1) I obtain an explained variance of 97.11 %.
So my question is, should I discard the first 4 variables? Because I think that the first 4 variables influence my model a lot. Is this an overfitting problem?
How to compare different models?
I'm comparing with RSME (Residual Squared Mean Error) the single tree with the ensemble methods. Is this a good practice?