logistic regression with an imbalanced data set, picking threshold cut point

Question

This question relates to whether it is a good starting point for a cut point in binary classification with logistic regression to the use the mean of the binary response variable as the initial cut point rather than simply 0.5.

Traditionally when people use logistic regression, people with use 0.5 as the threshold to determine when the model predicts YES/positive versus NO/negative.

People may run into trouble when the model only predicts one "answer" when using an imbalanced training set.

One way of dealing with this is to balance the training set via oversampling or under-sampling and keeping the test holdout set with the original balance.

However, I suspect that a good starting point for a cut point appears to be the mean of the binary response variable. Is this usually true?

I created two models, one on a balanced training set and another on the original imbalanced training set. print(table(actual=test$y, predicted=test$fit>0.5))

       predicted
 actual FALSE TRUE
      0  2359  500
      1    11  130

With the imbalanced training, I used the mean of the binary response variable:

print(table(actual=test$y, predicted=test$fit>0.0496))

       predicted
 actual FALSE TRUE
      0  2317  542
      1     7  134

If one just uses 0.5, it looks like the model is a complete failure:

`print(table(actual=test$y, predicted=test$fit>0.5))`

       predicted
 actual FALSE
      0  2848
      1   152

They both had a KS of 0.76, so it seems like sound advice.

Example R code:

require(ROCR)
require(lattice)
#
x=1:10000/10000;
y=ifelse(runif(10000)-0.7>jitter(x),1,0)
#y=ifelse(rnorm(10000)-0.99>x,1,0)
mean(y)

s=sample(length(x),length(x)*0.7);

df=data.frame(x=x,y=y)


##undersample
train=df[s,]
train=rbind(train[train$y==1,],train[sample(which(train$y==0),sum(train$y==1)),])
##oversample
train=df[s,]
train=rbind(train[train$y==0,],train[sample(which(train$y==1),sum(train$y==0),replace = T),])
mean(train$y) #now balanced
threshold=0.5
test=df[-s,] #unbalanced
mean(test$y)
#

ex=glm(y~x,train, family = "binomial")
summary(ex)
nrow(test)
test$fit=predict(ex,newdata = test,type="response")
message("threshold=",threshold)
print(table(actual=test$y, predicted=test$fit>threshold)) 

#+results
pred<-prediction(test$fit,test$y)
perf <- performance(pred,"tpr","fpr")
ks.sc=max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
plot(perf)
print(ks.sc); #ks.score


levelplot(fit~y+x,test,col.regions =  terrain.colors(100)[1:95]) 

#+ imbalanced approach
#############imbalance approach

train=df[s,]
threshold=mean(y)
message("threshold=",threshold)
ex=glm(y~x,train, family = "binomial")
summary(ex)
test$fit=predict(ex,test,type = "response")
summary(test$fit)
print(table(actual=test$y, predicted=test$fit>threshold)) 

print(table(actual=test$y, predicted=test$fit>0.5)) 

pred<-prediction(test$fit,test$y)
perf <- performance(pred,"tpr","fpr")
ks.sc=max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
plot(perf)
print(ks.sc); #ks.score


levelplot(fit~y+x,test,col.regions =  terrain.colors(100)[1:95])

I noticed a similar question asked How to choose the cutoff probability for a rare event Logistic Regression

I like the answer given here which states to maximize the specificity or sensitivity: Obtaining predicted values (Y=1 or 0) from a logistic regression model fit

But I also suspect that the usual starting cut off of 0.5 is bad advice.

Comments?

Use the relative costs of the two types of misclassification to determine the cut-off. If you haven't got costs why classify? - you have the generally useful predicted probabilities already. See e.g. [Cut-off point in a ROC curve. Is there a simple function?](http://stats.stackexchange.com/questions/61521/cut-off-point-in-a-roc-curve-is-there-a-simple-function) & [Based only on these sensitivity and specificity values, what is the best decision method?](http://stats.stackexchange.com/q/14153/17230). — Scortchi - Reinstate Monica, Aug 25 '15 at 16:15
A cut point above the average value of the binary variable (or better the average probability of success) will give you a classification/prediction better than a random selection. If this is your objective then it is a very reasonable thing to do. — AntoniosK, Aug 25 '15 at 18:22
Look at R package OptimalCutpoints for a quantitative approach. — HEITZ, Aug 25 '15 at 17:58

logistic regression with an imbalanced data set, picking threshold cut point

0 Answers0

Linked