Confusion between caret randomForest predict() results and reported model performance

Question

This question seems related, but the consensus was that the issue had to do scaling the data, which I do prior to training, so I don't think that's the issue:

Issue on prediction with FinalModel of RandomForest in R using the CARET package

I've uploaded a sample data set, and here is how I generated my model:

library(randomForest)
library(caret)
library(ggplot2)

data <- read.csv("http://pastebin.com/raw.php?i=mE5JL1dm")

data_pred <- data[, 1:(ncol(data) - 1)]
data_resp <- as.factor(data$y)

data_trans <- preProcess(data_pred, method = c("center", "scale"))
data_pred_scale <- predict(data_trans, data_pred)

trControl <- trainControl(method = "LGOCV", p = 0.9, savePredictions = T)

set.seed(123)
model <- train(x = data_pred_scale, y = data_resp,
               method = "rf", scale = F,
               trControl = trControl)

Here's what caret() reports as the model performance:

> model

Random Forest 

516 samples
 11 predictors
  5 classes: '0', '0.5', '1', '1.5', '2' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (25 reps, 0.9%) 

Summary of sample sizes: 468, 468, 468, 468, 468, 468, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.747     0.663  0.0643       0.0853  
  6     0.76      0.68   0.0507       0.068   
  11    0.758     0.678  0.0574       0.0763  

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 6.

In my "real" model, I have a training/hold-out set, and creating two sets of plots showing model predictions for the training/hold-out sets vs. the corresponding true observations. That's when I noticed something that seemed odd to me.

# data set of model predictions on training data vs. actual observations
results <- data.frame(pred = predict(model, data_pred_scale),
                      obs = data_resp)

table(results)

     obs
pred    0 0.5   1 1.5   2
  0   148   0   0   0   0
  0.5   0 132   0   0   0
  1     0   0 139   0   0
  1.5   0   0   0  38   0
  2     0   0   0   0  59

And here's a plot confirming 100% accuracy on the training set:

p <- ggplot(results, aes(x = pred, y = obs))
p <- p + geom_jitter(position = position_jitter(width = 0.25, height = 0.25))
p

training predictions

But if I look at the saved tuning predictions, subsetting only those where mtry = 6 (what caret reports as the final model), I don't get anywhere near that performance:

model_resamples <- model$pred[model$pred$mtry == 6, c("pred", "obs")

table(model_resamples)

        0 0.5   1 1.5   2
  0   296  69   5   0   0
  0.5  51 228  48   0   0
  1     3  28 255  24   9
  1.5   0   0  16  32  15
  2     0   0   1  19 101

And the same sort of plot:

p <- ggplot(model_resamples, aes(x = pred, y = obs))
p <- p + geom_jitter(position = position_jitter(width = 0.25, height = 0.25))
p

resampled predictions

Is this just a case of over-fitting where holding out 10% of the data can create a 25% decrease in performance, yet the final model trained with the same parameters but all rows can yield 100% performance? It seemed unlikely, but that's the only thing coming to mind at the moment.

I just want to make sure there's nothing wrong in my training or predicting methods where I'm creating a problem where there shouldn't be one.

Note: I created the tables/plots prior to adding set.seed() in the model training code above. The exact table and plot may differ slightly, but in re-running, they general result is the same (perfect re-prediction vs. ~77% reported by the model). It didn't seem to warrant re-doing the results/plots above, so I left them.

score 3 · Answer 1 · answered Jul 24 '14 at 00:20

3

The summary printed for the model contains the line

6     0.76      0.68   0.0507       0.068

which tells you that the expected/average accuracy for a proprley cross-validaded (training separated from testing) experiment should be 0.76

I have never used the line

model$pred[model$pred$mtry == 6, c("pred", "obs")

before but I guess it is giving you the aggregated results of all the internal cross-validations done when testing for mtry=6. You get a 0.7893916 which is pretty close to 0.76.

Caret, by default also generates the final model with all the training data provided, which is the model used in the line

pred=predict(model, data_pred_scale),

so what is curious is that the random forest generated gets a 100% accuracy when tested with the data used to train it. It is not impossible, of course, but just curious.

This phenomenon is not technically called overfitting, it goes beyond that - I do not know any good reason to test a classifier on the data used to train it.

answered Jul 24 '14 at 00:20

Jacques Wainer

5,032
1
20
32

3

In almost every case that I have seen, re-predicting the training set using random forest (and some other models) will give you perfect accuracy or zero RMSE. This is called the "apparent error rate" and is know for being wildly optimistic for low bias models. – topepo Jul 24 '14 at 02:43
@topepo I've seen the exact opposite. In almost every case *I've* seen, the model never fits the data perfectly. There's always been an error with the training set, and a [always higher] error with the hold-out set. – Hendy Aug 01 '14 at 19:29
@jacqueswainer: I'm not exactly "testing" the classifier on the training data... I just appreciate being able to plot the results. As I stated in my comment to topepo, I've *never* gotten a perfect prediction rate for training data. There's always been *some* error, and I appreciate visually representing the data. In essence, I suppose my plots above are really just visual confusion tables, showing how "heavy" the clusters are at each real vs. predicted intersected point. I've just never seen 100% accuracy ever, so it seemed odd to me. – Hendy Aug 01 '14 at 19:32
Oh, and I only tried to plot it based on using regression models in the past. I find it helpful to re-predict on the training and hold-out sets, as it's nice to see the how "off" the predictions were vs. the observed values. Maybe that's not as appropriate for classification? – Hendy Aug 01 '14 at 19:34
I can also confirm that I have observed this same thing multiple times in different datasets: the final "rf" model reports perfect prediction on the training dataset, i.e., running `predict(rf_model)`. I was concerned that there may have been a bug in the code, but I suppose it could simply be overfitting (high variance)? Hard to say without a sufficiently large independent test set. – Brian D Jan 13 '21 at 22:14

Confusion between caret randomForest predict() results and reported model performance

1 Answers1

Linked