Why is random forest performing worse than decision tree

Question

I have a data set with 1962 observations and 46 columns. Column 46 is the target with 3 classes 1, 2, 3. 6 of the other columns are nominal variables and the rest are ordinal variables. I have preprocessed them using as follows:

for (i in c(1:4,6,9,46)){
    cw_alldata_known[,i] <- as.factor(cw_alldata_known[,i])
}

for (i in c(5,7,8,10:45)){
  cw_alldata_known[,i] <- as.ordered(cw_alldata_known[,i])
}

Then I divide them 50/50 into training and test sets. I fitted a decision tree using party package of R:

cw.ctree <- ctree(cr~.,data = cw.train)

Then I also fitted a random forest model using randomForest package:

cw.forest <- randomForest(credit.rating ~ ., data=cw.train,ntree=107)

I have tried other ntree values but 107 seems to be the best. The accuracy on the test set of decision tree is around 61%, while random forest is only 56%. I read that random forest is often more robust and reliable. Why doesn't it perform better than decision tree in this case?

1962 observations (*before* splitting) is not a lot. RFs shine when there is a lot of data. We unfortunately won't be able to help you without knowing much more about your data. In the meantime, this may be helpful: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) — Stephan Kolassa, May 24 '18 at 06:28
You should take into account that "nodesize" is default to 1 in the RandomForest function. RandomForests hardly overfit, but with nodesize = 1 and only 1962 observations I would take care of overfitting. — Ferdi, May 24 '18 at 07:11

score 1 · Answer 1 · answered May 24 '18 at 13:16

It sounds as if you conducted only a single iteration of training-set and test-set crossvalidation. This will render your accuracy levels highly unreliable -- especially for the decision tree model. The accuracy of the random forest model, drawing on many decision trees, should be a little more reliable. But still, rather than using one such iteration, you should use many in order to obtain stable estimates of model predictive accuracy. One prominent author typically recommends at least 10,000 such iterations, and though I suspect that that many are not necessary, you'll find other authors who assert that the term "crossvalidation" doesn't even apply to the use of training and test sets unless there are multiple iterations.

Also note that there are other ways to split data besides 50-50 splits. You could look into k-fold procedures in which one might use more than half the data (perhaps 70-90%) for the more "demanding" task of building each model, and the smaller portion for testing it. I would think this would be especially relevant to a situation like yours, where you have so many cells (each reflecting combinations of ordinal and/or nominal variables) that need to be populated in order to establish a basis for prediction. But 45 predictors (with only 1,962 observations) will make that difficult, so you may want to reduce the number of predictors via data reduction or other approaches.

Why is random forest performing worse than decision tree

1 Answers1