All the Species Distribution Modelling literature suggests that when predicting the presence/absence of a species using a model that outputs probabilities (e.g., RandomForests), choice of the threshold probabilitiy by which to actually classify a species as presence or absence is important and one should not always rely on the default of 0.5. I need some help with this! Here is my code:
library(randomForest)
library(PresenceAbsence)
#build model
RFfit <- randomForest(Y ~ x1 + x2 + x3 + x4 + x5, data=mydata, mytry = 2, ntrees=500)
#eventually I will apply this to (predict for) new data but for first I predict back to training data to compare observed vs. predicted
RFpred <- predict(RFfit, mydata, type = "prob")
#put the observed vs. predicted in the same dataframe
ObsPred <- data.frame(cbind(mydata), Predicted=RFpred)
#create auc.roc plot
auc.roc.plot(ObsPred, threshold = 10, xlab="1-Specificity (false positives)",
ylab="Sensitivity (true positives)", main="ROC plot", color=TRUE,
find.auc=TRUE, opt.thresholds=TRUE, opt.methods=9)
From this I determined that the threshold I would like to use for classifying presence from the predicted probabilities is 0.7, not the default of 0.5. I don't totally understand what to do with this information. Do I simply use this threshold when creating a map of my output? I could easily create a mapped output with continuous probabilities then simply reclassify those with values greater than 0.7 as present, and those < 0.7 as absent.
Or, do I want to take this information and re-run my randomForests modeling, using the cut-off parameter? What exactly is the cut-off parameter doing? Does it change the resultant vote? (currently says it is "majority"). How do I use this cut-off parameter? I don't understand the documentation! Thanks!