logistic regression in r with many predictors

Question

I have been running logistic regression in R, and have been having an issue where as I include more predictors the z-scores and respective p-values approach 0 and 1 respectively. For example if have few predictors:

> model1
b17 ~ i74 + i73 + i72 + i71
> step1<-glm(model1,data=newdat1,family="binomial")
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.9461     1.8953  -3.665 0.000247 ***
i74           0.6842     0.9543   0.717 0.473384    
i73           1.7691     4.8008   0.368 0.712502    
i72           0.5134     2.0142   0.255 0.798812    
i71          -0.6753     4.9173  -0.137 0.890771

The results appear to be fairly reasonable; however, if I have more predictors:

 > model1
b17 ~ i90 + i89 + i88 + i87 + i86 + i85 + i84 + i83 + i82 + i81 + 
i80 + i79 + i78 + i77 + i76 + i74 + i73 + i72 + i71
> step1<-glm(model1,data=newdat1,family="binomial")
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.887e+02  3.503e+05  -0.001    0.999
i90          1.431e-01  1.009e+04   0.000    1.000
i89          8.062e+01  1.027e+05   0.001    0.999
i88          9.738e+01  7.398e+04   0.001    0.999
i87         -1.980e+01  9.469e+03  -0.002    0.998
i86          9.829e+00  1.098e+05   0.000    1.000
i85          5.917e+01  3.074e+04   0.002    0.998
i84         -2.373e+01  1.378e+05   0.000    1.000
i83          7.257e+00  2.173e+05   0.000    1.000
i82         -1.397e+01  1.894e+05   0.000    1.000
i81          6.503e+01  1.373e+05   0.000    1.000
i80          3.728e+01  4.904e+04   0.001    0.999
i79          1.010e+02  5.556e+04   0.002    0.999
i78         -2.628e+01  1.546e+05   0.000    1.000
i77          4.725e+01  3.027e+05   0.000    1.000
i76         -6.517e+01  1.509e+05   0.000    1.000
i74          1.267e+01  1.175e+05   0.000    1.000
i73          2.796e+02  5.280e+05   0.001    1.000
i72         -2.533e+02  4.412e+05  -0.001    1.000
i71         -1.240e+02  4.387e+05   0.000    1.000

I know it is hard to say exactly what is going on without seeing the data, but the predictors are all 5-point Likert Scale items. However, are there any thoughts to what is occurring here? I don't have much experience with logistic regression, so I apologize if the question seems naive, but is there a certain threshold of predictors where logistic regression falls apart due to having such a large amount of predictors what is ultimately a very small amount of variance? Is the potentially a multi-co-linearity issue? Finally, when I run OLS regression on the data I get results that make more sense (or at least appear to), is it okay/what are the consequences of running OLS regression on a binary outcome? Thank you!

You might find http://stats.stackexchange.com/questions/27442/logistic-regression-performance-with-high-number-of-predictors and http://stats.stackexchange.com/questions/45803/logistic-regression-in-r-resulted-in-hauck-donner-phenomenon-now-what to be of interest. — jld, Jun 30 '15 at 15:17
I know it kind of blends in but there is a warning in your output: "Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred" - this means that you should not look at any of the coefficients; they are essentially meaningless. — Twitch_City, Jun 30 '15 at 16:18
@Twitch_City Yes I saw the warning message and I understand what it means, I am more wondering what in the data itself/logistic regression to cause this type of issue and the accompanying warning. — costebk08, Jun 30 '15 at 17:20
@Chaconne thank you for the suggestions, they do appear to be quite useful for the issue. — costebk08, Jun 30 '15 at 17:24
You'll note in the second page linked by @Chaconne that this problem can occur with 24,000 rows and only 50 predictors, so it's not surprising that you have it with only 1% as many rows but nearly half as many predictors. Also, if each of your predictors is a 5-point scale, it seems that you are implicitly treating each of these as a numeric rather than as an ordinal variable. That might also lead to problems. — EdM, Jun 30 '15 at 17:30
@EdM that does make sense; however, I will say 5-point Likert items should actually be treated as numeric and not categorical. They are typically treated as interval level variables, as ideally (though not always the case) the magnitude between the numbers in the items should be equal, but anyway my conclusion after all this information is the best approach is to do OLS regression. — costebk08, Jun 30 '15 at 17:38
A rule of thumb is that you get into trouble with overfitting if you have fewer than 10-20 "events" per predictor variable. Here, the "events" would be the number of occurrences of the less-frequent of your binary outcomes, so you have no more than 101"events". You probably shouldn't be looking at more than 5-10 predictor variables. Consider your choices of predictor variables carefully, based on your knowledge of the subject matter. Also, consider a cross between numeric and categorical predictor variables, like the "scored" ordered categorical variables in the `rms` package in R. — EdM, Jun 30 '15 at 17:49
I agree with you completely; however, for my situation it is necessary to include up and possibly more than 100 predictors, but yes you make valid points. Thank you for the package I wasn't familiar with that one it looks useful. — costebk08, Jun 30 '15 at 17:54
Also, not to be annoying, but an upvote on the question would be appreciated as I did spend a substantial amount of time and effort compiling it as best I could. — costebk08, Jun 30 '15 at 18:02
@costebk08: Why are you concluding you should use OLS regression? It's not going to resolve over-fitting. If you've got to include lots of predictors look into regularization (I'm not sure why you would have to - from the little info. you've given I'd guess many people would use factor analysis or principal component analysis in your situation to reduce the number of predictors). And there are plenty of ways to deal with separation in logistic regression: see [How to deal with perfect separation in logistic regression?](http://stats.stackexchange.com/q/11109/17230). — Scortchi - Reinstate Monica, Jun 30 '15 at 21:35
@EdM: "Also, consider a cross between numeric and categorical predictor variables, like the "scored" ordered categorical variables in the rms package in R." - isn't a "scored" predictor just a categorical predictor with a particular coding scheme? — Scortchi - Reinstate Monica, Jun 30 '15 at 21:36
@Scortchi Yes I agree with you that CFA or PCA would be better, unfortunately I can't do that in my situation. — costebk08, Jun 30 '15 at 23:34
Is it possible to pre-group your individual Likert items into a smaller number of summed Likert-scale variables, based on your knowledge of the subject matter? — EdM, Jul 01 '15 at 01:15
@Scortchi: yes, ordered factor variables ("scored" appellation in the `rms` package) are just categorical variables, but as a wise person's answer to [this question](http://stats.stackexchange.com/questions/77796/coding-for-an-ordered-covariate/77827) points out they might have some utility over unordered factors in a case like this. — EdM, Jul 01 '15 at 01:24
@costebk08: Could you edit your question to explain your situation? - it would seem that it precludes more than one sensible approach to analysis of this data. (BTW, why to use a generalized linear model rather than OLS is explained [here](http://stats.stackexchange.com/a/104442/17230).) — Scortchi - Reinstate Monica, Jul 01 '15 at 16:45
@EdM: That's odd - the only answer I can see there is my own. But I think I see your point now: there are other ways to represent the effect of a $k$-level ordinal predictor than $\beta x$ where $ x= 1 \ldots k$. — Scortchi - Reinstate Monica, Jul 01 '15 at 16:57

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

Although the initial symptom was a type of problem seen in logistic regression, the underlying issue is that there are many predictor variables and only a comparatively small number of cases. That underlying issue needs to be addressed.

So first, if the outcome variable is binary you should not abandon logistic regression. The underlying issue will not go away by trying another type of analysis, even if it appears in a different form. For example, an ordinary least-squares model would tend to be highly over-fit (even if it were appropriate for binary outcomes) and thus highly unreliable. You said: "when I run OLS regression on the data I get results that make more sense (or at least appear to)" (emphasis added). Yes, the result of a regression on your data set might fit quite well, but in this situation your model would probably not apply beyond your initial data set.

Second, you can consider reducing the number of predictor variables based on prior knowledge of the subject matter. Likert items are often designed to be multiple questions aimed at a single opinion or personality trait, which are then combined to form a Likert scale as a better gauge of the opinion or trait. If prior knowledge of the subject matter allows combination of the 100 Likert items into 5 or 10 Likert scales as predictors, then the problem with the predictor/case ratio would be greatly diminished. The combination of multiple items into a smaller number of scales might also diminish problems resulting from a potentially incorrect assumption of equally-spaced influences of each of the 4 steps along each 5-point Likert item.

Third, although you say that you can't use PCA (for some unspecified reason; it's just a linear transformation of the original predictors) in this situation, note that the analysis of the correlation structure provided by PCA on the predictors, or clustering approaches, could well identify sets of items that are highly related, essentially measuring the same thing, and thus could be combined into a single predictor for analysis. It would seem that you would want to know these relations among the individual items in any event, so it's a bit concerning that you can't take the next obvious step into a principal-components regression (PCR).

Fourth, you can employ shrinkage methods to minimize the overfitting inevitable with a high ratio of predictors to cases. Ridge regression (unlike LASSO) would keep information from all your predictors, just weighting them differentially. If your objection to PCR is that you don't want to throw out any information from your predictors, then this might be a solution. (It's essentially a weighted principal-components regression, rather than the all-or-none selection of components in PCR.)

logistic regression in r with many predictors

1 Answers1

Linked