23

Is it possible to control the cost of misclassification in the R package randomForest?

In my own work false negatives (e.g., missing in error that a person may have a disease) are far more costly than false positives. The package rpart allows the user to control misclassification costs by specifying a loss matrix to weight misclassifications differently. Does anything similar exist for randomForest? Should I, for instance, use the classwt option to control the Gini criterion?

smci
  • 1,456
  • 1
  • 13
  • 20
user5944
  • 231
  • 1
  • 2
  • 3

5 Answers5

8

Not really, if not by manually making RF clone doing bagging of rpart models.

Some option comes from the fact that the output of RF is actually a continuous score rather than a crisp decision, i.e. the fraction of trees that voted on some class. It can be extracted with predict(rf_model,type="prob") and used to make, for instance, a ROC curve which will reveal a better threshold than .5 (which can be later incorporated in RF training with cutoff parameter).

classwt approach also seems valid, but it does not work very well in practice -- the transition between balanced prediction and trivial casting of the same class regardless of attributes tends to be too sharp to be usable.

  • MBQ. Many thanks. (i) ROC Curve: In this instance I don't require the ROC curve as I have my own priors on what the cost weighting should be. (ii) `classwt`: Yes, I have found that in practice, and in line with other users, the results are not as expected. (iii) `cutoff`: I'm not clear about how to utilise `cutoff` in this instance and I'd welcome any further advice. – user5944 Jan 04 '13 at 16:28
3

It's recommended that if the variable you are trying to predict is not 50% for class 1 and 50% for class 2 (like most of the cases), you adjust the cutoff parameter to represents the real OOB in summary.

For example,

randomForest(data=my_data, formula, ntree = 501, cutoff=c(.96,.04))

In this case, probability of having a value of one class 1 is .96 while having a value of class 2 is .04.

Otherwise random forests use a threshold of 0.5.

mkt
  • 11,770
  • 9
  • 51
  • 125
Pablo Casas
  • 548
  • 6
  • 9
3

There are a number of ways of including costs.
(1) Over/under sampling for each bagged tree (stratified sampling) is the most common method of introducing costs. you intentionally imbalance dataset.
(2) Weighting. Never works. I think this is emphasized in documentation. Some claim you just need to weight at all stages, including Gini spliting and final voting. If it is going to work, it is going to be a tricky implementation.
(3) Metacost function in Weka.
(4) Treating a random forest as a probabilistic classifier and changing the threshold. I like this option the least. Likely due to my lack of knowledge, but even though the algorithm can output probabilities doesn't make sense to me to treat them as if this was a probabilistic model.

But I'm sure there are additional approaches.

charles
  • 2,436
  • 11
  • 13
1

One can incorporate costMatrix in randomForest explicitly via parms parameter:

library(randomForest)
costMatrix <- matrix(c(0,10,1,0), nrow=2)
mod_rf <- randomForest(outcome ~ ., data = train, ntree = 1000, parms = list(loss=costMatrix))
Sergey Bushmanov
  • 1,107
  • 1
  • 8
  • 19
0

You can incorporate cost sensitivity using the sampsize function in the randomForest package.

model1=randomForest(DependentVariable~., data=my_data, sampsize=c(100,20))

Vary the figures (100,20) based on the data you have and the assumptions/business rules you are working with.

It takes a bit of a trial and error approach to get a confusion matrix that reflects the costs of classification error. Have a look at Richard Berk's Criminal Forecasts of Risk: A Machine Learning Approach, p. 82.

Tavrock
  • 1,552
  • 8
  • 27