2

I built a logistic regression model. How can I choose the optimal threshold by looking at the ROC? I want to be able to make the decision if the observation has the event.

acc = read.csv("path to data")
View(acc)

set.seed(1)
index  <- sample(1:nrow(acc), round(0.75*nrow(acc)))
train  <- acc[index,]
test   <- acc[-index,]
fitTrn <- glm(isOneday~., data=train, family=binomial(link="logit")) 
fitted.results <- predict(fitTrn, newdata=test, type='response')

library(ROCR)
p   <- predict(fitTrn, newdata=test, type="response")
pr  <- prediction(p, test$isOneday)
prf <- performance(pr, measure="tpr", x.measure="fpr")
auc <- performance(pr, measure="auc")

mydata

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
D.Joe
  • 157
  • 1
  • 7
  • 2
    Do you mean a threshold for significant coefficients in the logistic regression model? –  Jul 12 '17 at 12:47
  • Matt. i need the the value of threshold ,after which, i can do decision that event will occur –  Jul 12 '17 at 14:16
  • To clarify, you are looking for the optimal threshold for discriminating between outcomes using your logistic model. Is that correct? If so, it would help to get some more information. Specifically, you will almost certainly need information on the cost of following up on the model's output and the cost of making the "wrong" decision (in either direction) as well as the benefits of making the "right" decision. – Upper_Case Jul 12 '17 at 17:55
  • The contents of this thread are relevant. https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044#127044 – Sycorax Jul 12 '17 at 18:08

1 Answers1

0

Here is a python function I implimented that finds the threshold with most correct predictions:

def find_best_threshold(preds, Y):
    order = np.argsort(preds)    
    thresholds,counts =np.unique(preds,return_counts=True)
    start_idx=(np.cumsum(counts)-counts).astype(int)
    correct=Y.shape[0]-(np.cumsum(Y[order])-Y[order]+np.cumsum(Y[order[::-1]]==0)[::-1])[start_idx]
    return thresholds[np.argmax(correct)]

I not write R. But it such be easy to translate.

  • 1
    Number of correct predictions is not the right way to find an optimal ROC threshold. You get the most correct when you select everything, simple as that. Even accuracy, which is the rate of correct predictions, is quite bad. ROC curves are often used to assess performance in class-imbalanced problems, for which accuracy is a terrible measure of success. – Nuclear Hoagie Jul 12 '17 at 20:14
  • 1
    You do only rarely get the most correct by selecting a threshold of 1 or 0. Selecting this threshold will give you an accuracy of the dominant class and on most problems you can preform better than that. I agree that other thresholds might be better if you value type 1 errors more than type 2 errors or the other way around. In the question there is no information about that and this function yields the threshold that gives the lowest prediction error, which seems like the only sensible default threshold. – Peter Mølgaard Pallesen Jul 13 '17 at 07:19