1

Till now I have used a following flow for training a random forest model.

create 10 folds of data.
for each fold i:
    - use ith fold as validation data
    - use remaining 9 folds as training data
    - apply normalization on training and validation data
    - # apply feature selection on training data
    - # select same features from validation data
    - train random forest on training data
    - predict values for validation data
combine all predictions.

Now I want to do feature selection using varImp() function. I am confused as it is said that varImp itself trains a model on training data to find out best set of features.

How should I use varImp to get important features (say using partial least squares) and then again apply training model on training data?

exAres
  • 163
  • 6
  • Inside your loop, step/line 4 says "apply feature selection", how did you select these features? – horaceT Sep 12 '17 at 01:13
  • 1
    There's no need to use feature scaling with Random Forest. https://stats.stackexchange.com/questions/255765/does-random-forest-need-input-variables-to-be-scaled-or-centered If you're using random forest, you might as well apply a random forest-based feature selection method such as the Boruta algorithm. https://stats.stackexchange.com/questions/2350/why-does-the-random-forest-oob-estimate-of-error-improve-when-the-number-of-feat/2359#2359 – Sycorax Aug 13 '18 at 00:11

2 Answers2

1

This sounds a lot like recursive feature elimination. See the caret help page for feature selection.

topepo
  • 5,820
  • 1
  • 19
  • 24
0

I'd suggest removing the least important/lowest varImp features using this criterion, then re-runing test and validation sets. In some cases, but not always, this would increase your success. Note that normalization is generally not recommended for this approach and may in fact result in a loss of valuable signals.

katya
  • 2,084
  • 8
  • 11