How to change threshold for classification in R randomForests?

Question

All the Species Distribution Modelling literature suggests that when predicting the presence/absence of a species using a model that outputs probabilities (e.g., RandomForests), choice of the threshold probabilitiy by which to actually classify a species as presence or absence is important and one should not always rely on the default of 0.5. I need some help with this! Here is my code:

library(randomForest)
library(PresenceAbsence)

#build model
RFfit <- randomForest(Y ~ x1 + x2 + x3 + x4 + x5, data=mydata, mytry = 2, ntrees=500)

#eventually I will apply this to (predict for) new data but for first I predict back    to training data to compare observed vs. predicted
RFpred <- predict(RFfit, mydata, type = "prob")

#put the observed vs. predicted in the same dataframe
ObsPred <- data.frame(cbind(mydata), Predicted=RFpred)

#create auc.roc plot
auc.roc.plot(ObsPred, threshold = 10, xlab="1-Specificity (false positives)",
  ylab="Sensitivity (true positives)", main="ROC plot", color=TRUE,
  find.auc=TRUE, opt.thresholds=TRUE, opt.methods=9)

From this I determined that the threshold I would like to use for classifying presence from the predicted probabilities is 0.7, not the default of 0.5. I don't totally understand what to do with this information. Do I simply use this threshold when creating a map of my output? I could easily create a mapped output with continuous probabilities then simply reclassify those with values greater than 0.7 as present, and those < 0.7 as absent.

Or, do I want to take this information and re-run my randomForests modeling, using the cut-off parameter? What exactly is the cut-off parameter doing? Does it change the resultant vote? (currently says it is "majority"). How do I use this cut-off parameter? I don't understand the documentation! Thanks!

I would say this may belong here: The issue of (1) probability estimates from RF, (2) whether you can impose a cost function on the model or have to build it into the model, and (3) how to implement cost functions in RF are recurring issues that are not simply related to programming. — charles, Aug 18 '14 at 22:13

floodking · Answer 1 · 2014-10-03T13:33:13.590

8

#set threshold or cutoff value to 0.7

cutoff=0.7

#all values lower than cutoff value 0.7 will be classified as 0 (present in this case)

RFpred[RFpred<cutoff]=0

#all values greater than cutoff value 0.7 will be classified as 1(absent in this case)

 RFpred[RFpred>=cutoff]=1

edited Oct 03 '14 at 13:33

answered Sep 25 '14 at 00:20

floodking

323
2
7

1

Could you expand on your answer a little bit? At the very least it'd be useful to annotate your code. – Patrick Coulombe Sep 25 '14 at 01:08
2

FWIW, I think this is perfectly sufficient. – Sycorax Oct 03 '14 at 18:25
This answer is perfectly sound. I agree. – Seanosapien Oct 19 '17 at 12:57

score 7 · Answer 2 · edited Jul 27 '18 at 05:52

Sorry you haven't gotten and attempts at answers. Would recommend Max Kuhn's book for coverage of this issue. This is a fairly broad issue. Just add some bits:

ROC curves are popular, but only make sense if you're trying to understand the trade-off between the cost False Negative and False Positive results. If CostFN=CostFP then not sure they make sense. The c-statistic and other derived measures do still have use. If you want to maximize accuracy - just tune your model for this (caret package makes this easy), don't go making an ROC curve.
Everyone uses the probabilities derived from RF models. I think think some thought should be given to doing this - these are not probabilistic models, they aren't built to do this. It often works. At a minimum I would produce a validation plot of RF probabilies on new data if I was really interested in probabilies
The simplest way would be to use "simply reclassify those with values greater than 0.7 as present, and those < 0.7 as absent".
If cost(FN) does not equal cost(FP), then you need to make the RF cost- sensitive. R does not makes this easy. The weighting function in the RandomForest package doesn't work. The best option is to play around with the sampling, undersample majority case to get cost function you want. But the relationship between sample ratio and cost isn't direct. So you might want to stick with (3)

Update Regarding Class weights Andy Liaw:
"The current "classwt" option in the randomForest package has been there since the beginning, and is different from how the official Fortran code (version 4 and later) implements class weights. It simply account for the class weights in the Gini index calculation when splitting nodes, exactly as how a single CART tree is done when given class weights. Prof. Breiman came up with the newer class weighting scheme implemented in the newer version of his Fortran code after we found that simply using the weights in the Gini index didn't seem to help much in extremely unbalanced data (say 1:100 or worse). If using weighted Gini helps in your situation, by all means do it. I can only say that in the past it didn't give us the result we were expecting."

Could you elaborate about on subpoint (4) why the weighting argument doesn't work? — Sycorax, Oct 03 '14 at 18:26
My understanding was that it's appropriatly implemented in Fortran code (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm) but not R package. This is discussed: (https://stat.ethz.ch/pipermail/r-help/2011-September/289769.html) and centers around needing to use weights at all stages of tree building - not just Gini split. So current R implementation - that only uses weighting at split doesn't work very well — charles, Oct 04 '14 at 02:40

How to change threshold for classification in R randomForests?

2 Answers2

Linked