0

So I'm running a logistic regression on a binomial variable in R and came out with the following model.

> summary(Approach_stems_peri_log_model)

Call:
glm(formula = IndRev_PeriprostheticFractureSte ~ cpt_prim_stems_peri_log_train[, 
    i], family = binomial, data = cpt_prim_stems_peri_log_train, 
    maxit = 100)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9714  -0.9714  -0.9088   1.3984   1.9728  

Coefficients:
                                                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                                                -14.57     394.77  -0.037    0.971
cpt_prim_stems_peri_log_train[, i]Antero-lateral            12.77     394.78   0.032    0.974
cpt_prim_stems_peri_log_train[, i]Hardinge                  13.90     394.77   0.035    0.972
cpt_prim_stems_peri_log_train[, i]Hardinge/Anterolateral    13.87     394.78   0.035    0.972
cpt_prim_stems_peri_log_train[, i]Lateral (inc Hardinge)    13.62     394.77   0.035    0.972
cpt_prim_stems_peri_log_train[, i]Other                     12.86     394.78   0.033    0.974
cpt_prim_stems_peri_log_train[, i]Posterior                 14.06     394.77   0.036    0.972
cpt_prim_stems_peri_log_train[, i]Trochanteric Osteotomy    29.13     738.56   0.039    0.969

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1767.8  on 1364  degrees of freedom
Residual deviance: 1742.1  on 1357  degrees of freedom
  (32 observations deleted due to missingness)
AIC: 1758.1

Number of Fisher Scoring iterations: 13

and ran what is analogous to an F-test with the following command

> 1-pchisq(1767.8-1742.1, 1364-1357)
[1] 0.0005697522

I'm extremely confused on why this is happening. Is this due to the observations deleted due to missingness and that should decrease the difference in the degrees of freedom? I don't think that would matter at all. I can provide a number of graphs or anything requested but this is extremely strange to me. I understand that the t-test can not be significant and the F-test can be significant but the p-values in the model is way too high for me to think that the F-test would state it is extremely predictive of the model.

hubertsng
  • 103
  • 4

1 Answers1

1

First let us clarify that in fact you do not have t-tests but z-test although this does not affect the problem.

What you have observed here is the phenomenon of separation. This has a tag of its own on this site with many relevant posts. This one is particularly helpful How to deal with perfect separation in logistic regression? In my view the answer there by scortchi is the one to read first. The diagnostic signs of separation are very large values for coefficients with large standard errors. It is caused by having the outcome either always or never occur for some combination of predictors. In your case for instance 29.13 is equivalent to an odds ratio of $e^{29.13} = 4.8 * 10^{12}$ which is large.

mdewey
  • 16,541
  • 22
  • 30
  • 57
  • I never got perfect separation, or at least R never threw the error at me so I didn't think that should be an issue. But regardless, shouldn't the extremely large standard error make the test ultimately nonpredictive as the prediction interval would be extremely large. Is there something I'm missing about the function pchisq that does not take standard error into consideration, or at least reduces how much it penalizes the p-value at the end? – hubertsng Aug 14 '19 at 15:54
  • After removing all variables that had counts of less than 30, I got another model that has significant variables and a significant total model. I didn't account for the fact that the baseline variable had very little observations (total of 2 actually) so that would cause an error but if I set another baseline, it would report significant z-tests. Thanks! – hubertsng Aug 14 '19 at 15:59