randomForest vs randomForestSRC discrepancies

Question

There are two popular R packages to build random forests introduced by Breiman (2001): randomForest and randomForestSRC. I am noticing small, yet significant discrepancies in terms of accuracy between the two packages, even when I try to use the same input parameters. I understand we would expect a slightly different random forest, but in example below, randomForestSRC package consistently outperforms the randomForest package. I'm guessing there are other examples where randomForest is superior. Can someone please explain why these packages provide different predictions? Is there a way to generate a random forest for both packages using the same methodology?

In the example, there's no missing data, all values are distinct, mtry=1, and trees are grown until nodesplit=5. I believe the same bootstrap approach and split rule is used too. Increasing ntree or number of observations in the simulated dataset does not change the relative difference between the two packages.

library(randomForest)
library(randomForestSRC)

set.seed(130948) #Other seeds give similar comparative results
x1<-runif(1000)
y<-rnorm(1000,mean=x1,sd=.3)
data<-data.frame(x1=x1,y=y)

#Compare MSE using OOB samples based on output
(modRF<-randomForest(y~x1,data=data,ntree=500,nodesize=5))
(modRFSRC<-rfsrc(y~x1,data=data,ntree=500,nodesize=5))

#Compare MSE using a test sample
x1new<-runif(10000)
ynew<-rnorm(10000,mean=x1new,sd=.3)
newdata<-data.frame(x1=x1new,y=ynew)

mean((predict(modRF,newdata=newdata)-newdata$y)^2) #MSE using randomForest
    mean((predict(modRFSRC,newdata=newdata)$predicted-newdata$y)^2) #MSE using randomForestSRC

score 7 · Accepted Answer · answered Aug 05 '16 at 12:28

One of the causes of the packages producing different results is the way nodesize is implemented internally. In randomForest, the value appears to be a strict lower bound. In randomForestSRC, while we (unfortunately) don't document the subtlety, we will not attempt to split a node without at least 2 * nodesize replicates in a node. But when we do, it can result in one daughter < nodesize, and the other daughter >= nodesize. What we can say is that "on average" our terminal nodes across the forest will be of size = nodesize. The result is that we can grow slightly better trees than RF with the "same" setting.

If you set nodesize = 1 to avoid this issue, and accommodate for Monte Carlo effects by growing multiple forests with multiple simulations you will find that the MSE for both packages are coincident.

Wayne · Answer 2 · 2016-01-17T17:23:46.667

I like randomForestSRC a lot. It has some really nice plots and diagnostics.

There are a lot of choices of how to implement the algorithm. For example, looking at the help pages, rfsrc has a splitrule, where "The default rule is weighted mean-squared error splitting mse". How does randomForest do it? They each can control the size of trees, but do it in two different ways: one by specifying the maximum number of leaves, the other by specifying the maximum depth. There are dozens of such choices, some of which are exposed as parameters (the two I mentioned), but many which are not.

So I can't tell you exactly why they produce different results, but a random forest is not a closed-form like, say, OLS, nor is it an optimization (MLE) procedure. It's more algorithmic in nature so there are no mathematical reasons that would force them to agree.

Why do you ask? Are you simply looking for an explanation? Your question about forcing the same answer makes me think you're doing something like speed benchmarking and want to compare speeds for reaching exactly the same answer. Or something that might better be explicit.

EDIT: OK, per your comment, I'd recommend changing your question to be about the paradox and documenting what you did and your results.

My guess is, to the extent that an RV actually helps results, it's because adding an RV might dilute the effect of RF's preferring to split continuous or categorical variables with lots of levels. If so, trying RVs with randomForestSRC with split of zero (usual, deterministic) and non-zero (random splits) might illustrate this.

I'm comparing several RF algorithms. I read a strange paradox where adding a RV independent of the response can increase prediction accuracy. For example, if you add x2 — Peter Calhoun, Jan 16 '16 at 18:31
Which package improved more? My guess is `randomForest`, but mainly because it's older. I'd suggest putting your actual question out there would be more productive. The paradox, what you did with each package, what results you had, etc. You may get a direct answer about the paradox, or perhaps just a discussion on why package A seems to have ~3% less accuracy than B, but adding an RV resulted in 5% higher accuracy than B, which didn't change. — Wayne, Jan 17 '16 at 17:18
@PeterCalhoun: (See my suggestion to make your question explicit, above. Forgot to @ you.) — Wayne, Jan 17 '16 at 17:23
Thank you for your comments. However, the paper, "Predictor Augmentation in Random Forests" by Xu already explains the paradox. I am trying to figure out why `randomForest` and `randomForestSRC` give different predictions when both claim they are both implementing the original RF method. If one can explain why the packages give different predictions for the example provided, this would also explain differences in the paradox. I apologize if my initial question wasn't clear; I am simply trying to understand which part of the two algorithms leads to different results. — Peter Calhoun, Jan 18 '16 at 19:17

randomForest vs randomForestSRC discrepancies

2 Answers2

Linked