1

I would like to eventually use the PIMP-Algorithm (Permutation Variable Importance Measure) in order to get p values for the variables' importance. However, the formula

          "PIMP"(X, y, rForest, S = 100, parallel = FALSE, ncores=0, seed = 123, ...)

requires rForest which is an object of class randomForest.

I can carry out the 5 times repeated 10 fold cross-validation fine using caret.

   rf.fit <- train(T2DS ~ ., 
            data = mod_train.new, 
            method = "rf",     
            importance = TRUE, 
            trControl = trainControl(method = "repeatedcv", 
                                     number = 10, 
                                     repeats = 5))

However, I cannot seem to find any examples of documentation as to how to implement this using randomForest. The below is incorrect.

   rf.fit.try <- randomForest(T2DS ~., data=mod_train.new, importance=TRUE, 
      trControl=trainControl(method="repeatedcv", number=10, repeats=5))

Please could anybody suggest how the repeated measures cross-validation can be done using the randomForest package, or an alternative way I can calculate p values for my variable importances following permutation?

Willow9898
  • 57
  • 4
  • I don't fully understand your question. So you want to pass a randomForest object into PIMP-Algorithm, and this object, you would like the parameters to be optimized by CV? – StupidWolf May 22 '20 at 23:23
  • @StupidWolf, from my recent reading, I can see that caret can only be used to fine-tune accuracy (this can't be done via randomForest). However, I am assuming I can input the rf.fit$finalModel into the PIMP algorithm? I can see that the p values generated are more significant using this model from caret versus the randomForest package. – Willow9898 May 22 '20 at 23:34
  • you are more or less correct. It's not fine-tuning. caret test a series of mtry for the randomForest model and chooses the best mtry. If you run randomForest, and don't specify mtry, the default is ```floor(ncol(x)/3)``` for classification and ```floor(sqrt(ncol(x))``` for regression – StupidWolf May 22 '20 at 23:37
  • answer is yes, you can plug in the finalModel. it should be ok – StupidWolf May 22 '20 at 23:37
  • @StupidWolf, thanks for your help. Was just wondering ( if you're familiar with this) what your opinion is on SMOTE (Synthetic Minority Oversampling Technique To Handle Class Imbalancy In Binary Classification). I have an unbalanced data set with an excess of controls in an initial training set (69 vs 26). Would you randomly select ~ 29 controls from this so they are balanced or use SMOTE? – Willow9898 May 22 '20 at 23:43
  • it really depends on your data. https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning. you can just try the sampling first, then weighing them different, and last the more complicated smote – StupidWolf May 22 '20 at 23:46
  • don't need a more aggressive approach if the simpler one works right? – StupidWolf May 22 '20 at 23:47
  • Fair point, thank you for the link – Willow9898 May 22 '20 at 23:49

1 Answers1

1

You can always provide the final model obtained from training a randomForest with caret:

library(vita)
library(caret)
library(mlbench)
data(Sonar)

rf.fit <- train(Class ~ ., data = Sonar,method = "rf",importance=TRUE,
          trControl = trainControl(method = "repeatedcv",number = 10,repeats=5))

class(rf.fit)
[1] "train"         "train.formula"

 class(rf.fit$finalModel)
[1] "randomForest"

res = "PIMP"(Sonar[,-ncol(Sonar)], Sonar$Class, rf.fit$finalModel, S = 10)
StupidWolf
  • 4,494
  • 3
  • 10
  • 26