The coefficients changed a lot when I used all the factor levels versus when I limited to only one level of a factor as a subset of the data.
I am trying to do a logistic regression between the disease and contact exposure. There were several different sites, so I use the factor function (model:ml1). I also tried to focus on only a specific site:WB to analyze the association, which site was used as the subset of the data (model:ml2).
ml1<-glm(disease~x+**factor(site)**+factor(anycontact) +factor(comecat), data=gianalysis_bd, family= binomial )
summary(ml1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44400 0.25761 -13.369 < 2e-16 ***
x 0.24559 0.08309 2.956 0.003121 **
factor(site)FB 0.03967 0.15177 0.261 0.793792
factor(site)GB -0.54896 0.16538 -3.319 0.000902 ***
factor(site)HB 0.39635 0.14699 2.696 0.007010 **
factor(site)SB -0.13887 0.14347 -0.968 0.333069
factor(site)WB -0.06200 0.14647 -0.423 0.672067
factor(site)WP -0.03706 0.15388 -0.241 0.809677
**factor(anycontact)1 0.40856** 0.06846 5.968 2.41e-09 ***
factor(comecat)2 0.02260 0.07184 0.315 0.753037
factor(comecat)3 0.11195 0.07574 1.478 0.139405
ml2<-glm(disease~x+factor(anycontact) +factor(comecat), data=gianalysis_bd, **subset=site=="WB"**, family= binomial )
summary(ml2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4016 0.4347 -7.825 5.06e-15 ***
x 0.1421 0.1454 0.977 0.32834
**factor(anycontact)1 0.7380** 0.2590 2.850 0.00438 **
factor(comecat)2 -0.4049 0.2042 -1.983 0.04738 *
factor(comecat)3 0.1136 0.2182 0.520 0.60273
However, the coefficient of factor(anycontact) changed significantly, increasing from 0.4085 (ml1) to 0.7380. I could not tell why that happened (I think it should be the same in both the models). Can someone help to explain the difference between the two model and the reason? Thank you very much.