60

I'm trying to predict a binary outcome using 50 continuous explanatory variables (the range of most of the variables is $-\infty$ to $\infty$). My data set has almost 24,000 rows. When I run glm in R, I get:

Warning messages:  
1: glm.fit: algorithm did not converge  
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 

I've read the other responses that suggest perfect separation might be occurring, but I'm confident that isn't the case in my data (though quasi-complete separation could exist; how can I test to see if that's the case?). If I remove some variables, the "did not converge" error might go away. But that's not always what happens.

I tried using the same variables in a bayesglm function and got the same errors.

What steps would you take to figure out exactly what's going on here? How do you figure out which variables are causing the problems?

smci
  • 1,456
  • 1
  • 13
  • 20
Dcook
  • 733
  • 1
  • 7
  • 8
  • 7
    Why are you confident that separation isn't occurring? In [the `bayesglm` paper](http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf), they argue that separation is "a common problem, even when the sample size is large and the number of predictors is small" – David J. Harris Dec 13 '12 at 05:26
  • 2
    Another thought: `bayesglm` tries to avoid separation by adding a prior, but with 24,000 rows, the prior is probably getting swamped by the likelihood. Try shrinking `prior.scale`, possibly by a large amount. Also consider increasing the prior's degrees of freedom, which will help rule out large values associated with separation. – David J. Harris Dec 13 '12 at 05:31
  • Thanks for the suggestions David. I don't think separation is occurring because when I sort each of the explanatory variables, the dependent variable isn't always true or false for high or low values of the explanatory variables. Unless this is considered separation: the dependent variable is true for all x7 > 32 but x7 is only > 32 in 10 cases. Is there a way to verify the separation outside of a logistic regression? Or see which variable is causing the separation? I tried your bayesglm suggestions (I set prior.scale to 1 and prior.df to Inf) and still got the Hauck Donner errors. – Dcook Dec 13 '12 at 07:27
  • 1
    [related](http://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression?rq=1) [questions](http://stats.stackexchange.com/questions/5354/logistic-regression-model-does-not-converge?rq=1) – user603 Dec 13 '12 at 18:23
  • *"How do you figure out which variables are causing the problems?"* Binary-search is always a good fallback. You only have 50 variables, so if it's perfectly separated by one individual variable, 6 iterations will find the culprit. If it's two variables, at most 49+6=55 iterations will find it, worst-case. – smci Feb 02 '17 at 09:03
  • Also see my demo, [here](http://stats.stackexchange.com/questions/239928/is-there-any-intuitive-explanation-of-why-logistic-regression-will-not-work-for) – Haitao Du Feb 06 '17 at 20:16
  • I had a similar issue and fit the model with the brglm function in the [brglm](ftp://cran.r-project.org/pub/R/web/packages/brglm/brglm.pdf) package in R. Same syntax and everything and the model converged and gave similar answers to glm on unproblematic data. – mattador Jun 09 '17 at 04:58

1 Answers1

57

With such a large design space ($\mathbb{R}^{50}$!) it is possible to get perfect separation without having separation in any of the variable taken individually. I would even second David J. Harris's comment in saying that this is likely.

You can easily test whether your classes are perfectly separated in your design space. This boils down to solving a linear programming problem. An R implementation of this 'test' (not a test in the statistical sense of the term) is implemented in the safeBinaryRegression package.

If it turns out that separation is indeed the issue, and if you are only interested in a plain vanilla use of glm (e.g. glm is not called by a higher level function but by you), then there is an R implementation of an algorithms that slightly modifies the classical one to make it 'robust' against separation. It is implemented in the hlr package

user603
  • 21,225
  • 3
  • 71
  • 135
  • 5
    Very cool and useful answer! I'll have to look into those packages. (+1) – Peter Flom Dec 13 '12 at 11:40
  • 1
    FWIW here is a description of another robust algorithm: http://www.win-vector.com/blog/2012/10/rudie-cant-fail-if-majorized/ – Alex Dec 13 '12 at 17:59
  • 2
    @Alex: thanks for the link. If glm is not converging because of bad starts then I can see how this method will help with that. On the other hand, if the problem is caused by perfect separation it is not clear to me how the MM idea would address that. I was wondering whether you could comment on this (i can eventually post this as a separate question). – user603 Dec 13 '12 at 18:09
  • Excellent (+1)! I, too, will have to look into those packages. – jbowman Dec 13 '12 at 19:41
  • 1
    Thanks for the answer @user603! I used safeBinaryRegression and separation was indeed occurring with several of the variables. Then I tried using MEL in the hlr package to build a model robust to this separation. However, the coefficients are huge (as they would be when separation occurs in normal glm) and here are the df and deviance numbers: Degrees of Freedom: 19112 Total (i.e. Null); 19063 Residual Null Deviance: 24990 Residual Deviance: 626000 AIC: 626000 Do you think I did something wrong? – Dcook Dec 14 '12 at 03:59
  • @Dcook: We don't have a key piece of info: "and separation was indeed occurring with several of the variables" I would argue that you don't need to include any other variables (beside those 'causing' perfect separation) in your model. Is this what you did? Also, I would argue that once your model is reduced to this minimal set of regressors, the $t$-statistics/likelihood (and quantities derived from them) become meaningless. Maybe this warrants a separate question. – user603 Dec 14 '12 at 09:29
  • Try some cross-validation with the fitted model. It may be the case that the world is quasi-separated and the model is just capturing this reality. I doubt it, but it may worth to test it. – Manoel Galdino Dec 26 '12 at 15:15
  • 1
    As of May 2021, the hlr package no longer exists unfortunately, I don't know why or whether another one has replaced it. – Arnaud Mortier May 27 '21 at 16:32
  • @ArnaudMortier: I could not find one either. *but* the code was in pure R, so you should be able to easily use the archived version. – user603 May 27 '21 at 16:40
  • True, but there is still the question of why: was the model abandoned because it had flaws? I will post this as a separate question in a minute. +1 for this answer anyway. – Arnaud Mortier May 27 '21 at 16:42
  • The question is [here](https://stats.stackexchange.com/questions/526210/hidden-logistic-regression-state-of-the-art) for the record. – Arnaud Mortier May 27 '21 at 16:48