Let's consider the following dataframe
:
d = data.frame(Sex =factor(rep(c("Male","Female"),times=2), levels=c("Male","Female")),
Race =factor(rep(c("White","Black"),each=2), levels=c("White","Black")),
y =c(1, 3, 5, 7),
weights = c(0.01,0.03,0.02,0.01))
> d
Sex Race y weights
1 Male White 1 0.01
2 Female White 3 0.03
3 Male Black 5 0.02
4 Female Black 7 0.01
Let's assume that the weights are inverse propensity scores. If we calculate the weighted means for Sex
categories, we get this:
Sex y
mean(Male): 3.66
mean(Female): 4
#i.e., the difference is 0.34
However, if we fit a glm
on the above data and plugged weights in, we obtain the following coefficient estimates:
Call:
glm(formula = y ~ Sex + Race, data = d, weights = w)
Deviance Residuals:
1 2 3 4
0 0 0 0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1 0 Inf <2e-16 ***
SexFemale 2 0 Inf <2e-16 ***
RaceBlack 4 0 Inf <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0)
Null deviance: 0.22857 on 3 degrees of freedom
Residual deviance: 0.00000 on 1 degrees of freedom
AIC: -Inf
Number of Fisher Scoring iterations: 1
in which, the estimated coefficient for SexFemale
is 2. Note that if we exclude the weights = weights
we still obtain the same coefficients but with different estimates. Now I'm wondering why the two mean difference differs? What can I say about the mean difference in this situation? Should I based my evaluation on glm
estimates or what?