R - multivariate glm, what to use as a p-value and odds ratio?

Question

This is for a case-control study. I need to get a p-value and an odds ratio with confidence intervals from my glm, but I'm unsure of the best approach. I have the glm set up as follows:

lroverall <- glm(diagnosis~variant+location, overall, family=binomial)

Diagnosis (case/control), variant (yes/no), and location (A,B,C) are all categorical variables taken from my 'overall' dataset.

summary(lroverall) gives the output:

Call:
glm(formula = diagnosis ~ variant + location, family = binomial, 
    data = overall)
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.42270  -0.73877   0.00005   0.00005   2.67713  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)        0.5603     0.1727   3.244 0.001178 ** 
variantyes        -1.2194     0.5367  -2.272 0.023095 *  
locationA         -1.2050     0.2045  -5.892 3.82e-09 ***
locationB         -4.1156     1.0288  -4.000 6.32e-05 ***
locationC         -0.9249     0.2524  -3.664 0.000248 ***

For p-value, does it make sense to take the Pr(>|z|) for the variant (0.023)? Does this effectively measure association between diagnosis and variant while accounting for (removing?) effect of location? Or would I want to get a P-value for the overall model, or use a different test?

Similarly, is it appropriate to take the odds ratio for the variant (2.95e-01) calculated as below? :

exp(cbind("Odds ratio" = coef(lroverall), confint.default(lroverall, level = 0.95))

                  Odds ratio         2.5 %        97.5 %
(Intercept)     1.751193e+00  1.248321e+00  2.456640e+00
variantyes      2.954030e-01  1.031654e-01  8.458547e-01
locationA       2.996777e-01  2.007040e-01  4.474587e-01
locationB       1.631541e-02  2.172174e-03  1.225467e-01
locationC       3.965552e-01  2.417924e-01  6.503760e-01

Plain coef(lroverall) will give you $log{O_{y|x=1} \over O_{y|x=0}}$. You need to use exp(coef(lroverall)) to get the actual odds ratio. — Digio, Feb 16 '19 at 19:20
Hanaaa, are you sure that 'location' has only 3 levels? If so, then why are all three of them in the model? There should be only two of them in the model, as is the case with variantyes (you don't see variantno anywhere). You should run _levels(overall$location)_ to see what's happening there. — Digio, Feb 17 '19 at 19:35
I have exp() on the outside of cbind(), which applies to the coef(lroverall); did you mean I need an additional exp()? And sorry you're correct about the levels, there were several more locations in my actual problem. I quickly removed a few here for brevity, but I should have removed one more from the output or indicated that. — abana, Feb 19 '19 at 16:41
The exp() is fine, I was just asking about the levels because there's clearly more than 3. Are you OK with interpretation or do you still need an answer? — Digio, Feb 20 '19 at 12:49

score 2 · Answer 1 · answered Feb 16 '19 at 23:36

The issues of overall p-value calculations for glm() models are discussed on this page.

The p-values listed in the summary(glm()) reports are for differences of each individual coefficient from 0.

Interpreting these coefficients properly, however, can be difficult, and putting them together into odds ratios provides even more ways to go wrong. The default in R, which you have implicitly chosen by not specifying an alternative, is to use treatment contrasts.

With treatment contrasts the intercept is the log-odds for a particular reference scenario, in your case for variant=no at whatever your reference location happens to be (something other than A, B or C). The odds ratio for that scenario is as you have calculated it, 1.751, and its confidence intervals are OK as you calculated.

Each individual regression coefficient, however, then represents the difference associated with the predictor in question from that reference log-odds. So the log-odds for the case of variant=yes at your reference location is the sum of its coefficient with the intercept: $0.5603-1.2194=-0.6591$ for an odds ratio of 0.517. If you want the log-odds for variant=yes at location A, B, or C then you have to also add in that location's own coefficient.

Calculating the confidence intervals for specific log-odds or odds ratios has to use the information from the covariance matrix of the coefficients. You can't just use the individual standard errors (which are the square roots of the diagonal of that matrix) as there are typically covariances among the coefficient values (off-diagonal elements). Use vcov(lroverall) to get that covariance matrix. Then you need to use the formula for the variance of a sum of correlated variables to get the confidence intervals for specific cases. The rms package in R has facilities to simplify such calculations, but some find there to be a pretty steep initial learning curve for that package.

Thank you for the informative answer. I’m still a bit uncertain about p-value here. I suppose the p-value for overall fit isn't what I was asked to find. For the p-values reported by the glm: I can understand that the p-values are reported for the difference of a coefficient from 0, but in terms of my real world application--Would it be correct then to say the glm p-value for variant=yes is reported for the association of the variant with disease, after removing/‘controlling for’ effect of location? Or would I need to do an additional test? — abana, Feb 19 '19 at 17:21
@hanaaa there is a question whether you have adequately "controlled for" the effect of location with your particular model. The _p_-value reported for `variant=yes` assumes that the influence of `variant` on log-odds is independent of location. If that assumption is correct than you are correct. With such large baseline differences among locations, however, I would worry a lot about that assumption. You might need to consider a model with variant/location interactions. — EdM, Feb 19 '19 at 18:59

R - multivariate glm, what to use as a p-value and odds ratio?

1 Answers1