1

I have been trying to implement a logistic regression model in R (using mnlogit) for 12 predictor variables x1...x12 to predict a binary outcome y.

There are three variables (call them x1,x2,x3, the order doesn't matter) that consistently give NA values in the regression model whenever they co-occur with one another (regardless of which of the remaining 9 are included or excluded). I had assumed that this was because of (near) co-linearity among these variables. To test this, for data matrix AF, I ran

cor(AF, use="pairwise.complete")

While the cor(x1,x2) = 0.60, the cor(x1,x3) = 0.10, cor(x2,x3) = 0.2, which are smaller correlation than that observed for each variable with at least some of the remaining 9 predictor variables. For instance, cor(x1,x10) = 0.35, larger than the correlation between x1,x3 or x2,x3. However, a model with x1 and x10 together (as long as x2, x3 are excluded) returns an estimated regression coefficient rather than NA. The same applies for other pairings.

Could something other than correlation among predictor variables be responsible for the NAs in the regression model? In previous attempts to estimate this logistic regression model, I encountered as similar problem with "sparse" variables (i.e. where nearly all individual measurements are 0), but I have since removed these as well.

Addendum: I have included an example of output. Int_0, Int_1, and Out_Spr correspond to the 3 incompatible variables x1,x2,x3, note that the first 2 give NAs. This example has 19 rather than 12 predictor variables, but the results are qualitatively the same:

Call:
mnlogit(formula = fm, data = All_Final_last_long, ncores = 8, 
reflevel = "0")

Frequencies of alternatives in input data:
  0       1 
0.97935 0.02065 

Number of observations in data = 18305
Number of alternatives = 2
Intercept turned: OFF
Number of parameters in model = 19
# individual specific variables = 19
# choice specific coeff variables = 0
# individual independent variables = 0

-------------------------------------------------------------
Maximum likelihood estimation using the Newton-Raphson method
-------------------------------------------------------------
Number of iterations: 17
Number of linesearch iterations: 17
At termination: 
Gradient norm = 0.00104
Diff between last 2 loglik values = 6.8e-07
Stopping reason: Succesive loglik difference < ftol (1e-06).
Total estimation time (sec): 0.3
Time for Hessian calculations (sec): 0.22 using 8 processors.

Coefficients : 
              Estimate   Std.Error t-value  Pr(>|t|)    
TTo_last:1     -9.3977e+00  3.6825e+00 -2.5520   0.01071 *  
TTo_last_sq:1   3.1922e+01  2.4143e+01  1.3222   0.18611    
TTo_last_cub:1 -4.2153e+01  4.3544e+01 -0.9681   0.33302    
Out_Sum:1      -8.1411e+00  3.5207e+00 -2.3124   0.02076 *  
Out_Win:1      -6.4081e+00  3.5181e+00 -1.8215   0.06853 .  
Out_Spr:1      -7.3456e+00  3.5226e+00 -2.0853   0.03704 *  
TM_0:1          1.9803e+00  4.7466e+00  0.4172   0.67652    
TM_1:1         -2.3025e-01  2.0428e+00 -0.1127   0.91026    
Int_2:1         7.2317e-01  1.7757e-01  4.0726 4.649e-05 ***
TM_2:1         -8.7704e-01  3.9441e+00 -0.2224   0.82403    
Int_3:1         5.7395e-01  2.3668e-01  2.4250   0.01531 *  
TM_3:1         -7.3635e+00  1.3324e+01 -0.5526   0.58052    
Int_4:1         4.5607e-01  3.9432e-01  1.1566   0.24743    
TM_4:1          4.3723e+00  6.4967e+04  0.0001   0.99995    
Int_5:1         4.7611e-01  8.2288e-01  0.5786   0.56287    
TM_5:1          8.0781e+00  1.8213e+05  0.0000   0.99996    
Int_6:1        -1.3777e+01  3.0117e+03 -0.0046   0.99635    
Int_7:1        -1.3967e-01  8.6804e+03  0.0000   0.99999    
TS_9_1:1        5.5834e+00  4.6865e+00  1.1914   0.23350    
Int_0:1                 NA          NA      NA        NA    
Int_1:1                 NA          NA      NA        NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -1741.8, df = 19
AIC:  3521.5 
Max
  • 123
  • 4
  • 1
    You say you are using mnlogit -- Is it possible for you to run your model using glm()? Does that also produce the same issue? – bzki Dec 30 '19 at 19:22
  • I haven't tried it with glm but I can do so. Are the numerical methods used to estimate regression coefficients different for glm vs. mnlogit for logistic regression? – Max Dec 30 '19 at 19:34
  • Have you tried asking on Stack Overflow? Could you produce the output of one such attempt that gives NAs? – Todd Burus Dec 30 '19 at 19:38
  • I'm not exactly sure, but it's just a hunch (unless it takes a very long time to fit using glm) -- if the glm() doesn't have the issue then we can pinpoint the problem as something with mnlogit()'s estimation procedure. Also, glm might provide more helpful error/warnings if it has the same issue. – bzki Dec 30 '19 at 19:38
  • In response to T. Burus, someone on stackoverflow suggested that cross validated would be the more appropriate forum. I'll include output in the edited version of the original question. – Max Dec 30 '19 at 19:45
  • 1
    The correlation alone tells you the "co-linearity" but what you really want to check is the multicollinearity. What happens if you calculate the inverse of your design matrix in the settings where it doesn't work? Do you get an error stating that the matrix is singular? – jjet Dec 30 '19 at 19:46
  • 1
    I agree with @jjet. You're probably getting a singularity. Is it possible that some of your variables may be linear combinations of others? – Todd Burus Dec 30 '19 at 20:04
  • 1
    Are these 3 troublesome predictors continuous or categorical/binary? More information about what they mean and how they might be related, based on your knowledge of the subject matter, could help. Also, it seems that you are trying to do a logistic regression without an intercept, although I am unfamiliar with `mnlogit` and might be misinterpreting the code. If so, please see [this page](https://stats.stackexchange.com/q/260209/28500). – EdM Dec 30 '19 at 20:08
  • https://stats.stackexchange.com/questions/16327 provides methods to test your regressors for collinearity and even to identify which variables are involved in the dependencies. Check these out. – whuber Dec 30 '19 at 20:16
  • The predictor variables in question (x1...x3) are binary. In the meantime, I checked the design matrix X and found that transpose(X)X is invertible, so multicolinearity doesn't seem to be the issue here. – Max Dec 30 '19 at 20:53
  • 1
    Two thoughts. First, this might represent [perfect separation](https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression), with the `mnlogit` reporting the issue differently than `glm` would. Try with `glm` instead and see if you get a warning or an error message. Second, I'm still somewhat concerned by what I interpret as a logistic regression model that has omitted an intercept; not sure that would lead to this problem, but it can certainly lead to others. – EdM Jan 01 '20 at 18:06

0 Answers0