This question relates to whether it is a good starting point for a cut point in binary classification with logistic regression to the use the mean of the binary response variable as the initial cut point rather than simply 0.5.
Traditionally when people use logistic regression, people with use 0.5 as the threshold to determine when the model predicts YES/positive versus NO/negative.
People may run into trouble when the model only predicts one "answer" when using an imbalanced training set.
One way of dealing with this is to balance the training set via oversampling or under-sampling and keeping the test holdout set with the original balance.
However, I suspect that a good starting point for a cut point appears to be the mean of the binary response variable. Is this usually true?
I created two models, one on a balanced training set and another on the original imbalanced training set.
print(table(actual=test$y, predicted=test$fit>0.5))
predicted
actual FALSE TRUE
0 2359 500
1 11 130
With the imbalanced training, I used the mean of the binary response variable:
print(table(actual=test$y, predicted=test$fit>0.0496))
predicted
actual FALSE TRUE
0 2317 542
1 7 134
If one just uses 0.5, it looks like the model is a complete failure:
`print(table(actual=test$y, predicted=test$fit>0.5))`
predicted
actual FALSE
0 2848
1 152
They both had a KS of 0.76, so it seems like sound advice.
Example R code:
require(ROCR)
require(lattice)
#
x=1:10000/10000;
y=ifelse(runif(10000)-0.7>jitter(x),1,0)
#y=ifelse(rnorm(10000)-0.99>x,1,0)
mean(y)
s=sample(length(x),length(x)*0.7);
df=data.frame(x=x,y=y)
##undersample
train=df[s,]
train=rbind(train[train$y==1,],train[sample(which(train$y==0),sum(train$y==1)),])
##oversample
train=df[s,]
train=rbind(train[train$y==0,],train[sample(which(train$y==1),sum(train$y==0),replace = T),])
mean(train$y) #now balanced
threshold=0.5
test=df[-s,] #unbalanced
mean(test$y)
#
ex=glm(y~x,train, family = "binomial")
summary(ex)
nrow(test)
test$fit=predict(ex,newdata = test,type="response")
message("threshold=",threshold)
print(table(actual=test$y, predicted=test$fit>threshold))
#+results
pred<-prediction(test$fit,test$y)
perf <- performance(pred,"tpr","fpr")
ks.sc=max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
plot(perf)
print(ks.sc); #ks.score
levelplot(fit~y+x,test,col.regions = terrain.colors(100)[1:95])
#+ imbalanced approach
#############imbalance approach
train=df[s,]
threshold=mean(y)
message("threshold=",threshold)
ex=glm(y~x,train, family = "binomial")
summary(ex)
test$fit=predict(ex,test,type = "response")
summary(test$fit)
print(table(actual=test$y, predicted=test$fit>threshold))
print(table(actual=test$y, predicted=test$fit>0.5))
pred<-prediction(test$fit,test$y)
perf <- performance(pred,"tpr","fpr")
ks.sc=max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
plot(perf)
print(ks.sc); #ks.score
levelplot(fit~y+x,test,col.regions = terrain.colors(100)[1:95])
I noticed a similar question asked How to choose the cutoff probability for a rare event Logistic Regression
I like the answer given here which states to maximize the specificity or sensitivity: Obtaining predicted values (Y=1 or 0) from a logistic regression model fit
But I also suspect that the usual starting cut off of 0.5 is bad advice.
Comments?