Difference in results for predict on caret package "train" object and "train$finalModel" object

Question

Newish to R and new to CrossValidated. I have a question about the predict method for caret "train" objects.

I'm running a randomForest model using caret package and am trying to produce some simple ROC curves. It was my understanding that predict.train() and predict.randomForest() for the $finalModel part of the same train object should produce the same results. However, the results are very different -- in the example below, accuracy is .992 for values from predict.train() and .438 for values from predict.randomForest().

This is similar to this post: Whether preprocessing is needed before prediction using FinalModel of RandomForest with caret package? but I don't do any preProcessing, and this: Confusion between caret randomForest predict() results and reported model performance, but the difference is significant enough so that I don't think it is a difference in seeds.

Here is some reproducible code:

    library(caret)
    library(titanic)
    Titanic = data.frame(Titanic)
    mc = trainControl(method='boot',     classProbs=TRUE,returnResamp='final',summaryFunction = defaultSummary)
    Titanicmodel <- train(x=Titanic[,-(4)], y=Titanic[,'Survived'],method='rf',trControl=mc, metric='Accuracy')

    pred_train <- predict(Titanicmodel, type='raw') #caret predict
    prob_train <- predict(Titanicmodel, type='prob') #caret predict 
    confusion_train <- confusionMatrix(pred_train,Titanic[,'Survived'])
    confusion_train
    plot(pROC::roc(Titanic[,'Survived'],prob_train[,'Yes']))

    pred_final <- predict(Titanicmodel$finalModel, type='response') #randomForest predict
    prob_final <- predict(Titanicmodel$finalModel, type='prob') #randomForest predict 
    confusion_final <- confusionMatrix(pred_final,Titanic[,'Survived'])
    confusion_final
    plot(pROC::roc(Titanic[,'Survived'],prob_final[,'Yes']))

My confusion probably has something to do with the specific parameters I'm using for train or trainControl, but I'm not sure which one.

Please let me know if there is a post that addresses this that I've missed.

Welcome to CV. This may be related to the "bias-variance tradeoff" issue as discussed in this CV thread... http://stats.stackexchange.com/questions/152882/question-about-bias-variance-tradeoff — Mike Hunter, Mar 25 '16 at 20:25
hmm, perhaps, but if it is it's because I don't understand / can't figure out from the documentation what caret is actually doing. From the documentation I understood that predict.train() was using $finalModel parameters to make predictions, but since the results are not exactly the same, that is apparently not the case. — mfriedri, Mar 25 '16 at 20:31
I can't help you with the R code but Max Kuhn, developer of caret, has a book about it *Applied Predictive Modeling* that may provide some clues — Mike Hunter, Mar 25 '16 at 20:59
For those who come across this post later, I think the answer to my question is here: http://stats.stackexchange.com/questions/66543/random-forest-is-overfitting — mfriedri, May 03 '16 at 15:24

Difference in results for predict on caret package "train" object and "train$finalModel" object

0 Answers0