Low probability levels when doing logistic regression

Question

I am building a Logistic regression model for a churn problem. When I scored the out of sample data set, I find very low probability levels as the output probability. Conventionally, I would look for .5 as the cut off but this scored population doesnt have many customers above .5 ( say just 1%). Seeing the business cause, we need atleast 5% people to be approached for the impact.

I therefore reduced the cut off probability to judge the scored dataset. So now, I am defining .1 probability as the cut off. The model is very good at that level, in that it is perfectly distinguishing my target from non target.

Is there any problem with this approach, given that at .1 level, model has very good accuracy.
what in general is the cause of low probabilities at scored population level.

The key question is not so much the probability of the outcomes, but their sample sizes relative to the number of independent variables in the model. One rule of thumb is to have 10 cases in the smaller category for each independent variable. — Peter Flom, Mar 07 '12 at 12:01

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

Re 1: If you predict well in the hold out sample then you're doing well (no time to worry about propriety ;-) But since you're asking...

One way to look at the threshold is that when you set it to 0.1 you are implicitly specifying a loss function. That is, separating the question of what to do (e.g. approach a customer) from what to infer (e.g. that the probability is of 1 is 0.15). Indeed, you might make this separation a bit more explicit in your question. For example, you talk about needing to approach 5% of some people for something to be worthwhile. And then about how well you can predict cases. Is the issue that to approach the `right' 5% (presumably the true '1's) you might have to approach many more (true '0's) to no effect? Then the cost of approach is relevant and the threshold should be set to minimise loss. But you also say you can predict the held out cases well when the threshold is set at 0.1...

Re 2: The cause of low probabilities is an unbalanced category distribution. This may cause estimation problems, though don't automatically assume that it will. If it does you can often correct them quite easily by changing the training data set structure and correcting parameters or in other ways. There's some discussion here, a link to a good paper, and much more discussion elsewhere in the site - just search for 'unbalanced sample'.

Thanks Cunjugate..read the paper..If my understanding is right, low probability levels might be also because low event rate( in my case 1.2%)...what do u say ? — ayush, Mar 12 '12 at 08:54
@ayush What do I say? In general, what I say is my answer above. In particular, in paragraph 3 sentence 1 I say that "the cause of low probabilities is an unbalanced category distribution". For example, having the probability of a 1 (what you describe as the 'event rate') of 1.2%. That answers your second question. — conjugateprior, Mar 12 '12 at 10:32

Low probability levels when doing logistic regression

1 Answers1

Linked