Just realized that the predicted probability of default in logistic regression is small and spread within a narrow range. My first guess is that it could be related to the not-strong-features selected during the modeling. My question, is this normal or did somebody observe this phenomena before? how to correctly transform as it represents the 'True Probability" spreading within (0.0, 1.0) thanks.
-
Logistic regression is well known *not* to fit the tails of its distribution. If the prob of default is small and falls in the tails, then there are much better models that will deliver much more accurate results. Examples include poisson regression and the links to papers by Gary King in this thread... http://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes/235817#235817 – Mike Hunter Sep 29 '16 at 14:50
-
1@DJohnson: *Poisson regression* for classification? Could you elaborate? – Stephan Kolassa Sep 29 '16 at 14:52
-
Hi DJohnso, I never tried poisson regression. however, as I remembered, the poisson regression is for count data, right? Maybe I should try it. it won't hurt me. – tiger Sep 29 '16 at 18:18
-
@StephanKolassa Default is y/n. When the p of default is small then a poisson model can provide a better fit as it is intended for use with rare event data. – Mike Hunter Sep 30 '16 at 13:27
-
@tiger Sure. It's far from a magic bullet but it's worth a try. – Mike Hunter Sep 30 '16 at 14:26
2 Answers
If your features are not overly predictive, then your predicted probabilities won't be very high.
For instance, if you are classifying people as having a particular disease or not, and all you have is their temperature, the predicted probability that someones suffers from this particular disease will be small - after all, a high temperature could be due to any number of other diseases.
Work on getting better predictive features - in the disease example, blood tests, a look down the patient's throat etc. It doesn't make sense to transform the predicted probabilities from a logistic regression that does not cleanly separate the classes.
Alternatively, your final classification can use a different threshold than 0.5. Nothing is stopping you from classifying samples with a predicted probability of 0.3 as being in the target group. What probability threshold you use should be governed by the relative costs of Type I and Type II errors.

- 95,027
- 13
- 197
- 357
-
Hi Stephan, thanks. this is what I am doing right now, using a different cutoff points. The small probabilities won't affect my AUC since they are used as ranks. it is hard to get strong predictive features in my case. – tiger Sep 29 '16 at 15:13
-
Hi Stephan, could you look at my other questions, which posted on stackflow? here is a link. thanks. http://stackoverflow.com/questions/39734484/what-does-it-mean-when-the-hold-out-sample-never-seen-in-modeling-auc-is-great – tiger Sep 29 '16 at 15:17
Maybe is the range of the explanatory variables:
If you have temperatures (according to other users' example) vs probability of having the disease, the probability at the maximum measured temperature is 60%.
Logistic regression model doesn't know you are talking about temperature (is just another continuous variable). So the model could be saying that "The probability of having the disease is 95% when the temperature is 95°C (really high)". The model doesn't know that this temperature would have killed a human way before has been reached.
So, in conclusion: Having (0.0,0.6) is not an incorrect output, it reflects the reality (but with flaws)

- 129
- 4