13

I use the caret package for training a randomForest object with 10x10CV.

library(caret)
tc <- trainControl("repeatedcv", number=10, repeats=10, classProbs=TRUE, savePred=T) 
RFFit <- train(Defect ~., data=trainingSet, method="rf", trControl=tc, preProc=c("center", "scale"))

After that, I test the randomForest on a testSet (new data)

RF.testSet$Prediction <- predict(RFFit, newdata=testSet)

The confusion matrix shows me, that the model isn't that bad.

confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
              Reference
    Prediction   0   1
             0 886 179
             1  53 126  

      Accuracy : 0.8135          
             95% CI : (0.7907, 0.8348)
No Information Rate : 0.7548          
P-Value [Acc > NIR] : 4.369e-07       

              Kappa : 0.4145 

I now want to test the $finalModel and I think it should give me the same result, but somehow I receive

> RF.testSet$Prediction <- predict(RFFit$finalModel, newdata=RF.testSet)
>  confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 323  66
         1 616 239

               Accuracy : 0.4518          
                 95% CI : (0.4239, 0.4799)
    No Information Rate : 0.7548          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.0793 

What am I missing?

edit @topepo :

I also learned another randomForest without the preProcessed option and got another result:

RFFit2 <- train(Defect ~., data=trainingSet, method="rf", trControl=tc)
testSet$Prediction2 <- predict(RFFit2, newdata=testSet)
confusionMatrix(data=testSet$Prediction2, testSet$Defect)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 878 174
         1  61 131

               Accuracy : 0.8111          
                 95% CI : (0.7882, 0.8325)
    No Information Rate : 0.7548          
    P-Value [Acc > NIR] : 1.252e-06       

                  Kappa : 0.4167     
Glorfindel
  • 700
  • 1
  • 9
  • 18
Frank
  • 265
  • 2
  • 3
  • 9
  • in the first instance, you predicted with a train object which you called `RFFit`, in the second time you predicted using the model object, I guess. So the difference might be in passing other things along with the train object that processed your new test data somehow differently than without using the train object. – doctorate Jan 08 '14 at 17:33
  • 4
    For the 2nd `train` model you will get a slightly different result unless you set the random number seed before running it (see `?set.seed`). The accuracy values are 0.8135 and 0.8111, which are pretty close and only due to the randomness of resampling and the model calculations. – topepo Jan 09 '14 at 12:59

1 Answers1

17

The difference is the pre-processing. predict.train automatically centers and scales the new data (since you asked for that) while predict.randomForest takes whatever it is given. Since the tree splits are based on the processed values, the predictions will be off.

Max

topepo
  • 5,820
  • 1
  • 19
  • 24
  • but the `RFFit` object is created with the preProcessed `train` method...so it should return a centered and scaled object (shouldn´t it?). If so -> the `$finalModel` should also be scaled and centered – Frank Jan 09 '14 at 06:30
  • 2
    Yes but, according to the code above, you have not applied the centering and scaling to `testSet`. `predict.train` does that but `predict.randomForest` does not. – topepo Jan 09 '14 at 12:55
  • so there is no difference in using `predict(RFFit$finalModel, testSet)` and `predict(RFFit, testSet)` on the same testSet? – Frank Jan 10 '14 at 14:05
  • 6
    `predict(RFFit$finalModel, testSet)` and `predict(RFFit, testSet)` will be different if you use the `preProc` option in `train`. If you do not, they are training on the same dataset. In other words, any pre-processing that you ask for is done to the training set prior to running `randomForest`. It also applied the same pre-processing to any data that you predict on (using `predict(RFFit, testSet)`). If you use the `finalModel` object, you are using `predict.randomForest` instead of `predict.train` and none of the pre-processing is done before prediction. – topepo Jan 14 '14 at 23:14