7

I am getting completely different results from cforest and randomForest with regards to variable importance (mean decrease in accuracy):

require(randomForest)
require(party)
require(RCurl)

x <- getURL("https://raw.githubusercontent.com/mguzmann/r-funcs/master/data/tempData.csv")

tempData <- read.csv(text = x, quote="\"", row.names=1)

tempData$resp = as.factor(as.numeric(tempData$resp))
tempData$a = as.factor(as.numeric(tempData$a))
tempData$b = as.factor(as.numeric(tempData$b))
tempData$c = as.factor(as.numeric(tempData$c))
tempData$d = as.factor(as.numeric(tempData$d))

tempData.rf = randomForest(resp ~., tempData, importance=T)
importance(tempData.rf)

(simplified output)

MeanDecreaseAccuracy

a 10.18511
b 15.88859
c 27.13357
d 15.36184

tempData.cf = cforest(resp ~., tempData,
        controls=cforest_unbiased(ntree=500, mtry=4))
varimp(tempData.cf)

MeanDecreaseAccuracy

a 0.02219259 b 0.02903704 c 0.06500741 d 0.03765926

How should I compare these numbers and interpret their discrepancies? Why does this happen?

mguzmann
  • 575
  • 1
  • 6
  • 14

2 Answers2

3

The two importances agree for a and c actually. There is a good tutorial about variable importance here. These are some of the guys who worked in cforest.

Also, I think the values are different in absolute value because if you check the documentation of the randomForest package it says:

importance(x, type=NULL, class=NULL, scale=TRUE, ...)

Which means that by default the importance of feature $\mathbf{x}_j$ is standardized: $\frac{\mbox{VI}(\mathbf{x}_j)}{\hat{\sigma}/\sqrt{\mbox{ntree}}}$

Where

$$\mbox{VI}(\mathbf{x}_j) = \frac{\sum_{t}^{\mbox{ntree}}\mbox{VI}^{(t)}(\mathbf{x}_j)}{\mbox{ntree}}$$

Simone
  • 6,513
  • 2
  • 26
  • 52
2

Traditional Random Forest uses "Gini Gain" splitting criterion in assessing Variable Importance, which is biased towards factor variables with many levels/categories. In contract, cforest function creates random forests not from CART trees, but from unbiased classification trees based on conditional inference, which gives much more robust results when multifactorial variables are involved, particularly when the function is used with subsampling without replacement. Here's a good paper that explains the differences more in depth:

Bias in random forest variable importance measures: Illustrations, sources and a solution. Strobl et al. (2007)

Kasia Kulma
  • 203
  • 1
  • 10