Random forest in R using unbalanced data

Question

I'm trying to build a Random Forest classifier in R that will identify people with a diagnosis. In the ecological setting (medical examination) there will probably be a rough 50%/50% proportion, but in my training set I have data from the general population, so I have ~1400/180 N.

If I sample 180 N from the non-diagnosed sample I get roughly 90% accuracy in both groups (fair, but I want a bit better). If I use the entire dataset, I get 98% accuracy for nonclinical and 60% for clinical (useless).

I have ~150 features, and the other psychometrics of the tool/data set are very good.

I'm trying to use the classwt argument to correct with weights, but can't get anything useful out of it.

forest <- randomForest(as.factor(Diagnosis) ~ ., data=dataset, importance=TRUE, ntree=1000, keep.forest=TRUE)

score 10 · Answer 1 · edited Apr 13 '17 at 12:44

10

Here's some code to explain the usage and here's a link to a thread linking to more threads discussing how to handle unbalanced RF. In short you can implement your prior expectation by changing voting rule (cutoff), using stratified sampling (strata +sampsize) or classwt. I usually use strata. I was a little surprised in code example below that classwt's had to be that much skewed.

library(randomForest)
library(AUC)

make.data = function(N=1000) {
  X = data.frame(replicate(6,rnorm(N))) #six features
  y = X[,1]^2+sin(X[,2]) + rnorm(N)*1 #some hidden data structure to learn
  rare.class.prevalence = 0.1
  y.class = factor(y<quantile(y,c(rare.class.prevalence))) #10% TRUE, 90% FALSE
  return(data.frame(X,y=y.class))
}

#make some data structure
train.data = make.data()

#1 - Balancing by voting rule, AUC of ROC will be unchanged...
rare.class.prevalence = 0.1
rf.cutoff = randomForest(y~.,data=train.data,cutoff=c(1-rare.class.prevalence,rare.class.prevalence))
print(rf.strata)

#2 - Balancing by sampling stratification
nRareSamples = 1000 * rare.class.prevalence
rf.strata = randomForest(y~.,data=train.data,strata=train.data$y,
                         sampsize=c(nRareSamples,nRareSamples))
print(rf.strata)

#3 - Balancing by class-weight during training.
rf.classwt = randomForest(y~.,data=train.data,classwt=c(0.0005,1000))
print(rf.classwt)

#view OOB-CV specificity and sensitiviy
plot(roc(rf.cutoff$votes[,2],train.data$y),main="black default, red stata, green classwt")
plot(roc(rf.strata$votes[,2],train.data$y),col=2,add=T)
plot(roc(rf.classwt$votes[,2],train.data$y),col=3,add=T)


#make test.data and remove random sample until both classes are equally prevalent
test.data = make.data(N=50000)
test.data.balanced = test.data[-sample(which(test.data$y=="FALSE"))[1:40000],]

#print prediction performance %predicted correct:
sapply(c("rf.cutoff","rf.strata","rf.classwt"),function(a.model) {
  mean(test.data.balanced$y == predict(get(a.model), newdata=test.data.balanced))
})

edited Apr 13 '17 at 12:44

Community

1

answered Aug 23 '15 at 13:31

Soren Havelund Welling

6,224
26
31

I'm not sure this is consistent with optimum decision theory, and the use of any cutoff is arbitrary and information-losing. – Frank Harrell Aug 23 '15 at 15:00
What "this" is not concistent, stratification or classwt or RF? Yes cutoff is crude, but it is very practical to know RF not only perform aggregation by majority vote. Also cutoff can be modified after training. I extended the code example to show that for the code example cutoff finally do almost as well as stratification and class weight. – Soren Havelund Welling Aug 23 '15 at 16:48
"This" refers to classification, which always has an element of arbitrariness and failure to transport to other situations with different base probabilities. – Frank Harrell Aug 23 '15 at 17:07
I see :) Would you find it acceptable to transfer vote-fractions from the ensemble as pseudo-probalistic predictions? – Soren Havelund Welling Aug 23 '15 at 18:50
I'm not familiar enough with that approach. But I know that many researchers have used a method to convert RF to give probability estimates. I just don't know the details. – Frank Harrell Aug 23 '15 at 20:31
Thank you both for your input. Very appreciated from someone who is not really from the mathemathics side. Balancing the class weights is what I was trying to do, even though I see prof. Harrels point. The big/small numbers needed is what confused me. Is there a not very bad or complex way of calculating them, rather than just using for example c(0.0005,1000) "because they work"? – Hammar Aug 23 '15 at 21:28
I was honestly surprised that inverse prevalence was far from sufficient weighting as a suggested thumb-rule in the original documentation https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm. Personally I prefer stratification/downsampling for being to me very transparent and performs equally well. If you expect 50/50%, you stratify bootstraps 50%/50%. The vote ratios of predictions can be used to rank(perhaps quantify) test samples class likeliness. – Soren Havelund Welling Aug 23 '15 at 22:26

score 3 · Answer 2 · answered Aug 23 '15 at 13:19

You have hit upon one of the many problems caused by attempting "all or nothing" dichotomizations of inherently continuous quantities, in this case risk (probability). Optimum decision making uses estimated risks in conjunction with a cost/utility/loss function. Classification is an arbitrary manipulation where a real weakness is exposed as you try to move from one disease prevalence to a drastically different prevalence. If the right background variables are in the model (or are considered by RF and RF output is stated in terms of risk) no correction may be needed. Your situation is curiously the reverse of most. Typically a case-control study is done first and results need to be applied to a lower prevalence situation. But for either case, not having the background risk variables in the model that would allow for correction for prevalence means that you will need to be content to state the result as something like relative odds instead of absolute risk. That would be trivial with a logistic model.

Random forest in R using unbalanced data

2 Answers2