0

I have run a Logistics Regression model in my data set. Below is the code:

classifier=glm(formula = Attrition~.,family = binomial,data=hr_train)
prob_pred=predict(classifier,type = 'response',newdata = hr_test[-2])
y_pred = ifelse(prob_pred>0.5,1,0)
cm=table(hr_test[,2],y_pred)

Here in my LR model, i have taken cutoff value randomly. I want to know how to find the cutoff value , which can maximize the sensitivity and specificity.

I saw some posts regarding this, as i am new to R programming i am not able to apply in my model. Could anyone please help me.

  • could you please elaborate on where you're stuck/what you've tried? – MichaelChirico Dec 22 '18 at 22:09
  • @MichaelChirico..Thanks for reply..Actually i want to find the best cutoff point in my LR model. Here i have taken it as 0.5 randomly. But i want to calculate which one should be more accurate ?? –  Dec 22 '18 at 22:11
  • 1
    Questions about _statistically_ what's appropriate are a bit off-topic here... but you mentioned maximizing sensitivity/specificity. have you done anything to this end? Have you calculated sensitivity/specificity for the 0.5 cutoff? – MichaelChirico Dec 22 '18 at 22:32
  • @MichaelChirico... Yes i have calculated for 0.5. But could you please help me at which cutoff i can maximize my sensitivity and specificity.. Which should be the best cutoff for my model. –  Dec 22 '18 at 22:38
  • Now you've calculated these for .5... can you calculate the same scores for cutoff of 0.4? 0.6? Then try `which.max` and `max`... – MichaelChirico Dec 22 '18 at 22:49
  • https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification – kjetil b halvorsen Jan 01 '19 at 12:28

2 Answers2

4

The logistic regression model is a probability model. It is inappropriate to think of cutoffs when using it. The use of a cutoff for a decision threshold is separate from the modeling process and makes a strong assumption that the cost/loss/utility function (consequences of decisions) is the same for all observations/subjects. In general, defer thresholding or dichotomization to decision makers. Details are here.

A good use of probability estimates from the model is creation of a lift curve whereby observations are ranked by predicted probability and you select the "biggest bang for the buck" based on the budget or time allowed.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
0

The logistics regression cut off for threshold has nothing to do with the R program ( or any other programming language). Threshold is a value for probability which you think is desirable as per the model you are building.

Example, in a model where you want to classify an email as a spam/ not spam using logistic regression. After you train your model on numbers of email data and test your model on another set of email data. Now, by understanding how well your model is classifying the email into spam / not spam the value of threshold is set. To start with try to get rmse for the model and then set up the threshold value.