1

I'm trying to use the randomForest algorithm in R. When I apply the algorithm to my database, I got the following result:

library(randomForest)

x = data[,1:8]
y = data[,9]

model <- randomForest(y ~ ., x)
model
Call:
 randomForest(formula = y ~ ., data = x) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 23.7%
Confusion matrix:
    0   1 class.error
0 430  70   0.1400000
1 112 156   0.4179104

The answer is according to the prediction:

pred <- predict(model, type = "class")
table(pred, y)
1-mean(pred==y)
  y
pred   0   1
   0 430 112
   1  70 156
[1] 0.2369792

Here, above, we have the same error rate. However, when I try a new prediction with the explicit initial x values, the result is different:

pred2 <- predict(model, newdata = x, type = "class")
table(pred2, y)
1-mean(pred2==y)
     y
pred2   0   1
    0 500   0
    1   0 268
[1] 0

I've already tried the same procedure with the glm function, and in both cases, the result were the same. What is the difference between the two predictions above?

  • The way you used the formula argument in the RF model is weird, `randomForest(y ~ ., x)` y is a vector and all the predictors are in x. The formula argument expects all variables in the same data frame when you provide a formula. Use `randomForest(y=y,x=x)`. – user2974951 Jul 30 '19 at 13:47
  • I changed to the way you said but I didn't change a thing. – lpedrassoli Jul 31 '19 at 08:10
  • 1
    Can you post a chunk of your data or a reproducible example? – user2974951 Jul 31 '19 at 09:23

1 Answers1

1

In the first case (predict(model, type='class') we look at OOB errors and in the second case (predict(model, type='class', data = x) the training data are treated as new data. This second option will artificially inflate the goodness-of-fit from the RF model. CV.SE has very relevant thread under: What measure of training error to report for Random Forests? I would suggest reading through it .

usεr11852
  • 33,608
  • 2
  • 75
  • 117