Variable importance in party vs randomForest

Question

I am getting completely different results from cforest and randomForest with regards to variable importance (mean decrease in accuracy):

require(randomForest)
require(party)
require(RCurl)

x <- getURL("https://raw.githubusercontent.com/mguzmann/r-funcs/master/data/tempData.csv")

tempData <- read.csv(text = x, quote="\"", row.names=1)

tempData$resp = as.factor(as.numeric(tempData$resp))
tempData$a = as.factor(as.numeric(tempData$a))
tempData$b = as.factor(as.numeric(tempData$b))
tempData$c = as.factor(as.numeric(tempData$c))
tempData$d = as.factor(as.numeric(tempData$d))

tempData.rf = randomForest(resp ~., tempData, importance=T)
importance(tempData.rf)

(simplified output)

MeanDecreaseAccuracy

a 10.18511
b 15.88859
c 27.13357
d 15.36184

tempData.cf = cforest(resp ~., tempData,
        controls=cforest_unbiased(ntree=500, mtry=4))
varimp(tempData.cf)

MeanDecreaseAccuracy

a 0.02219259 b 0.02903704 c 0.06500741 d 0.03765926

How should I compare these numbers and interpret their discrepancies? Why does this happen?

Simone · Accepted Answer · 2015-07-02T11:50:50.347

3

The two importances agree for a and c actually. There is a good tutorial about variable importance here. These are some of the guys who worked in cforest.

Also, I think the values are different in absolute value because if you check the documentation of the randomForest package it says:

importance(x, type=NULL, class=NULL, scale=TRUE, ...)

Which means that by default the importance of feature $\mathbf{x}_j$ is standardized: $\frac{\mbox{VI}(\mathbf{x}_j)}{\hat{\sigma}/\sqrt{\mbox{ntree}}}$

Where

$$\mbox{VI}(\mathbf{x}_j) = \frac{\sum_{t}^{\mbox{ntree}}\mbox{VI}^{(t)}(\mathbf{x}_j)}{\mbox{ntree}}$$

edited Jul 02 '15 at 11:50

answered Jul 02 '15 at 07:08

Simone

6,513
2
26
52

I am actually wondering what they mean by $\hat{\sigma}$ in those slides. – Simone Jul 02 '15 at 07:08
I'm sorry, I don't understand, how are a and d the same? or do you mean the ranking? – mguzmann Jul 02 '15 at 09:16
Sorry I meant a and c. They agree that a is the worst and c is the best. – Simone Jul 02 '15 at 11:50

score 2 · Answer 2 · answered Feb 08 '17 at 08:39

Traditional Random Forest uses "Gini Gain" splitting criterion in assessing Variable Importance, which is biased towards factor variables with many levels/categories. In contract, cforest function creates random forests not from CART trees, but from unbiased classification trees based on conditional inference, which gives much more robust results when multifactorial variables are involved, particularly when the function is used with subsampling without replacement. Here's a good paper that explains the differences more in depth:

Bias in random forest variable importance measures: Illustrations, sources and a solution. Strobl et al. (2007)

Variable importance in party vs randomForest

2 Answers2