GLM high standard errors, but variables are definitely not collinear

Question

When I use a GLM using R, my standard errors are ridiculously high. It can't be because the independent variables are related because they are all distinct ratings for an individual (i.e., interaction variables are out of the picture). Any idea on what is causing this?

Below is the contingency table and glm summary:

             swagtype
has.gc.swag  A  B  C  D
      FALSE  1 22 71 49
      TRUE   0  1  2  5

summary(glm(has.gc.swag~swagtype, family=binomial, data=tt.dataset))
...
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -15.57    1455.40  -0.011    0.991
swagtypeB      12.48    1455.40   0.009    0.993
swagtypeC      12.00    1455.40   0.008    0.993
swagtypeD      13.28    1455.40   0.009    0.993

Note: I use swagtype instead of the real name since the info I am dealing with is confidential.

Can you give the model? probably your sample size is to small, and the table shows a very uneven distribution, which doesn't help. — kjetil b halvorsen, Aug 07 '15 at 16:25
Model? You mean dataset? I'm not sure how to attach the dataset, but I have edited to original question to include glm code. Also, thanks for reaching out :) — Froyo Lover, Aug 07 '15 at 16:44
Also, I understand that there is an uneven distribution (it is very hard to get GC Swag), so I intend on using a GLM contrast between the different coefficients... After I resolve this high standard error issue. — Froyo Lover, Aug 07 '15 at 16:47
About model: You said only GLM, there are many GLM"s, is this logistic regression (I guess so, but tell us). If then, the asymptotic standard errors can be very misleading, this is called Hauck-Donner effect, see see http://stats.stackexchange.com/questions/45803/logistic-regression-in-r-resulted-in-hauck-donner-phenomenon-now-what Also see MASS (the book), 4th edition, page 197. If so, try the profile function from MASS (the R package). — kjetil b halvorsen, Aug 07 '15 at 16:57

score 3 · Answer 1 · edited Aug 07 '15 at 19:29

Looking at your data, I would say that the optimizer in glm is groaning under the task of trying to fit this model where so many of the cells are so small. You are right to be suspicious of the numbers your are getting, which scream "convergence problem". Just because the optimizer doesn't think it has failed, don't assume it has actually found an intelligent answer.

This is basically a two-way contingency table, and using glm isn't going to work any better than a Pearson chi-squared would.

So you could try a Fisher's exact test, using fisher.test(), to get a p-value.

Later: I tried to replicate your analysis with your data but I didn't have the same problems that you got.

swag    <- factor(c("A","B","C","D"))
hasSwag <- c(1,22,71,49)
totals  <- c(1,23,73,54)
summary(glm(cbind(hasSwag, totals) ~ -1 + swag, family=binomial))

That's the setup.

Coefficients:
      Estimate Std. Error z value Pr(>|z|)
swagA  0.00000    1.41421   0.000    1.000
swagB -0.04445    0.29822  -0.149    0.882
swagC -0.02778    0.16668  -0.167    0.868
swagD -0.09716    0.19730  -0.492    0.622

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  2.9282e-01  on 4  degrees of freedom
Residual deviance: -1.0880e-14  on 0  degrees of freedom
AIC: 24.169

Number of Fisher Scoring iterations: 2

So it seems to have worked.

With your data in a table, I can do Fisher's:

> mat
     [,1] [,2] [,3] [,4]
[1,]    1   22   71   49
[2,]    0    1    2    5

fisher.test(mat)

Fisher's Exact Test for Count Data

data:  mat
p-value = 0.2549
alternative hypothesis: two.sided

So this is the same conclusion as the logistic: you can't distinguish between the swag types on the basis of this data set.

score 2 · Answer 2 · answered Aug 07 '15 at 19:08

The problem as you posed it comes from R's treatment-based coding of factor variables and a poor choice of the reference level for the factor variable. I've come up against this in survival analysis when my first choice of reference level had only a few events.

In typical R summary output, all individual factor levels are compared against the reference factor level: in your situation against level A, which only has 1 case. So you necessarily have difficulty in distinguishing the other levels from level A. If you relevel to set D as your reference level, you would get easier-to-interpret summaries of your glm for the other levels of this factor.

That said, I also see no advantages of the glm over the contingency-table approach recommended by @Placidia.

gung - Reinstate Monica · Answer 3 · 2015-08-07T19:49:41.033

Like @Placidia, I also don't get quite the same results as you (I suspect the data you present differ somehow from the data you are actually using), but I do get the same general phenomenon. The reason is that you and I used reference level coding, whereas she used level means coding.

tt.dataset = read.table(text="  A  B  C  D
   1 22 71 49
   0  1  2  5", header=T)
tt.dataset = as.data.frame(t(as.matrix(tt.dataset)))
tt.dataset$swagtype = rownames(tt.dataset)
rownames(tt.dataset) = NULL
colnames(tt.dataset)[1:2] = c("no", "yes")
tt.dataset
#   no yes swagtype
# 1  1   0        A
# 2 22   1        B
# 3 71   2        C
# 4 49   5        D
summary(glm(cbind(yes,no)~swagtype, family=binomial, data=tt.dataset))
# ...
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -22.57   48196.14       0        1
# swagtypeB      19.48   48196.14       0        1
# swagtypeC      19.00   48196.14       0        1
# swagtypeD      20.28   48196.14       0        1
# ...
#     Null deviance: 2.6956e+00  on 3  degrees of freedom
# Residual deviance: 3.1675e-10  on 0  degrees of freedom
# AIC: 15.926
# 
# Number of Fisher Scoring iterations: 21

The actual problem is that you have a separation between the yeses and the nos (note that TRUE and FALSE are names that you are strongly recommended not to use) in your dataset. As @Kjetilbhalvorsen notes in the comments, this is also called the Hauck-Donner phenomenon. People often think of separation as being within a single variable, and you can't see the separation in the table. But separation can very much exist amongst several variables, which is what you have here. There are a couple tip-offs in the output. First is the huge standard errors. Another is the very large number of "Fisher Scoring iterations" (21, instead of a more typical 4-5).

An important effect of the separation is to make the standard errors very large, which essentially makes the Wald tests worthless. But you can still get a valid test by doing a likelihood ratio test. (This is what you should be doing anyway, because you don't actually care how each of B, C, and D compare to A, but how swagtype relates to the response.) Since you only have the one (factor) variable, you can get a likelihood ratio test using the null and residual deviances and degrees of freedom:

1-pchisq(2.6956e+00-3.1675e-10, df=3)
# [1] 0.4409755

If you only compare B, C, and D, you no longer have the separation issue and your results look more normal:

summary(glm(cbind(yes,no)~swagtype, family=binomial, data=tt.dataset, 
            subset=swagtype!="A"))
# ...
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)   
# (Intercept)  -3.0910     1.0225  -3.023   0.0025 **
# swagtypeC    -0.4785     1.2488  -0.383   0.7016   
# swagtypeD     0.8087     1.1251   0.719   0.4723   
# ...
#     Null deviance: 2.5863e+00  on 2  degrees of freedom
# Residual deviance: 2.4425e-15  on 0  degrees of freedom
# AIC: 13.926
# 
# Number of Fisher Scoring iterations: 4
1-pchisq(2.5863e+00-2.4425e-15, df=2)
# [1] 0.274405

GLM high standard errors, but variables are definitely not collinear

3 Answers3

Linked