How to deal with the "sure probability" (p=1) in logistic regression

Question

The logistic regression model is:

$$\log\bigg(\frac{p}{1-p}\bigg) = \ldots$$

The most interesting case (for me) is the case that we have $p=1$ and $p=0$. But in this case, the ratio $p/(1-p)$ doesn't exist

For example: In my model, $p$ is the probability that the customer will come back after the first purchase. We observed that 100% of the clients with the income $> 5000$ euros comes back after the first purchase ($p=1$), and 100% of the clients with the income $< 1000$ don't ($p=0$).

When I treat the income as a continuous explanatory variable, there is no problem (income is a significant variable). But it isn't significant when I segment income into intervals $(0,1000)$, $(1000,3000)$, $(3000,5000)$, $(>5000)$. All the categories become non-significant. I think it's because of the $p=1$ that makes $1-p=0$, then the ratio $p/(1-p)$ is degenerate. What should I do in this case?

"100% of the clients with the income > 5000 euros return after the first purchase" -- what is the sample size (both total and for this > 5k euros subsample)? — Adrian, May 28 '15 at 15:38
Have a look at [How to deal with perfect separation in logistic regression?](http://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression?rq=1) question. And why would you want to segment the continuous variable in the first place? — lanenok, May 28 '15 at 15:41
Sorry I can't help a bit more, but I think you might make some progress by looking at it like a potentially biased coin. As the number of samples go up, you can be more and more confident that the observed probability is equal to the real probabilty, but you can continue to estimate the real probability which will never become 1 (although it approaches 1). http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair — juckele, May 28 '15 at 15:14
See also [What is the benefit of breaking up a continuous predictor variable?](http://stats.stackexchange.com/q/68834/17230). — Scortchi - Reinstate Monica, Jun 01 '15 at 13:04

score 7 · Accepted Answer · answered May 31 '15 at 22:49

Even though you should not trust the estimates of 0 and 1 for probabilities (and penalization methods suggested by others are worth doing) there is nothing inherently wrong with them. By all means, avoid categorization of continuous variables which makes things worse in every way. Complete separation is only a problem if using Wald tests and confidence intervals. Use likelihood ratio tests and likelihood confidence profiles for interval estimates and all is well.

score 4 · Answer 2 · edited Apr 13 '17 at 12:44

I think the confusion here is that you do not apply the logit transformation to your data; it is properly applied to the underlying parameter. (It may help to read my answer here: Difference between logit and probit models.)

Of course, you don't have access to the underlying parameter. If you have multiple observations per study unit, that is, your $Y$ value is a binomial rather than the outcome from a single Bernoulli trial, you could try to estimate it directly from your data. I gather that is what you wanted to try. That is not how logistic regression models are fit, however. Instead, software will search over possible parameter estimates. Newton's method (often called Newton-Raphson), is the most common algorithm. Very roughly, a guess at the appropriate $\hat\beta_j$'s is made, and the predicted probabilities and the joint likelihood of the data conditional on them are calculated. Then the algorithm 'looks around' to see if the fit would be improved by changing the initial guess. If so, the slope estimates are changed and the process is repeated. This will continue until the improvement is less than some threshold.

The fact that in your observed dataset, at the extreme of some variables, there are no 'failures' does not lead to any problems for this method. If you have perfect separation (also called the Hauck Donner effect), that will lead to problems for methods like this, because separation implies the best slope value is $\pm\infty$. For more on that topic, it may help to read this excellent CV thread: How to deal with perfect separation in logistic regression?, or browse some of our other threads tagged with hauck-donner-effect.

Regarding your explicit question of what you should do here, the best answer is not to categorize a continuous variable into intervals. That procedure leads to lots of well-known problems (for some discussion, see here: How to choose between ANOVA and ANCOVA in a designed experiment?). If you don't have access to precise income levels (you didn't bin your data, you were given those intervals), you can try replacing the intervals with their midpoints and using that as a continuous variable. This will lead to some measurement error in your predictor variable / an errors in variables problem, but if the values aren't too far from the true incomes relative to the range of incomes in your dataset, it shouldn't be too bad.

score -1 · Answer 3 · answered May 28 '15 at 15:23

This is called complete separation and happens when there are combinations of categorical covariates that produce a sample proportion of 0 or 1 in the outcome variable. There is nothing to be done here because it is not possible to estimate the coefficients or their standard errors in the presence of separation. The MLEs for the relevant coefficients/interactions will be $\pm \infty$. I would suggest choosing categories that do not have this problem.

This is a great answer until the last sentence. – Frank Harrell May 31 '15 at 22:50 — Frank Harrell, May 31 '15 at 22:50

How to deal with the "sure probability" (p=1) in logistic regression

3 Answers3