From what I can tell, we cannot run an ordinary least squares
regression in R when using weighted data and the survey
package. Here,
we have to use svyglm()
, which instead runs a generalized linear model
(which may be the same thing? I am fuzzy here in terms of what is
different).
svyglm
will give you a linear model if you use family = gaussian()
which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel
.
It seems that the package just makes sure that you use the correct weights when it calls glm
. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian()
. The result is a weighted linear model. This answer
Why can we not run OLS in the survey
package, while it seems that this
is possible to do with weighted data in Stata?
by stating that you indeed can do that with the survey
package. As for the following question
What is the difference in interpretation between the deviance of a
generalized linear model and an r-squared value?
There is a straight forward formula to get the $R^2$ with family = gaussian()
as some people have mentioned in the comments. Adding weights does not change anything either as I show below
> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
>
> # show data and fit model. Notice the R-squared
> head(org)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
3 -2 -0.3615302 3
4 -1 0.7091697 4
5 0 0.6485203 5
6 1 3.8495979 6
> summary(lm(y ~ x, org, weights = weights))
Call:
lm(formula = y ~ x, data = org, weights = weights)
Weighted Residuals:
Min 1Q Median 3Q Max
-3.1693 -0.4463 0.2017 0.9100 2.9667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7368 0.3514 4.942 0.00113 **
x 0.9016 0.1111 8.113 3.95e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8781
F-statistic: 65.83 on 1 and 8 DF, p-value: 3.946e-05
>
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
2.1 -3 -0.5675720 2
3 -2 -0.3615302 3
3.1 -2 -0.3615302 3
3.2 -2 -0.3615302 3
>
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))
Call:
lm(formula = y ~ x, data = org_redundant)
Residuals:
Min 1Q Median 3Q Max
-1.19789 -0.29506 -0.05435 0.33131 2.36610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73680 0.13653 12.72 <2e-16 ***
x 0.90163 0.04318 20.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8896
F-statistic: 436.1 on 1 and 53 DF, p-value: < 2.2e-16
>
> # glm gives you the same with family = gaussian()
> # just compute the R^2 from the deviances. See
> # https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept) x
1.7368017 0.9016347
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387
The deviance is just the sum of square errors when you use family = gaussian()
.
Caveats
I assume that you want a linear model from your question. Further, I have never used the survey
package but quickly scanned through it and made assumptions about what it does which I state in my answer.