R-squared in linear model verses deviance in generalized linear model?

Question

Here's my context for this question: From what I can tell, we cannot run an ordinary least squares regression in R when using weighted data and the survey package. Here, we have to use svyglm(), which instead runs a generalized linear model (which may be the same thing? I am fuzzy here in terms of what is different).

In OLS and through the lm() function, it calculates an R-squared value, the interpretation of which I do understand. However, svyglm() does not seem to calculate this and instead gives me a Deviance, which my brief trip around the internet tells me is a goodness-of-fit measure that is interpreted differently than an R-squared.

So I guess I essentially have two questions on which I was hoping to get some direction:

Why can we not run OLS in the survey package, while it seems that this is possible to do with weighted data in Stata?
What is the difference in interpretation between the deviance of a generalized linear model and an r-squared value?

Welcome to the site, @RichardBlissett, +1 for a good question. OLS regression is a special case of the generalized linear model, where the link function is the identity function & the response distribution is normal (see my answer here: [difference-between-logit-and-probit-models](http://stats.stackexchange.com/questions/20523//30909#30909), for more info). There are 'pseudo-R2's for GLiMs, but they're controversial (see here: [which-pseudo-r2-to-report-for-logistic-regression](http://stats.stackexchange.com/questions/3559/), for more info). — gung - Reinstate Monica, Mar 14 '13 at 03:43
Thanks so much for your comment (and sorry it took so long for me to respond...I lost this question and had completely forgotten that I didn't put it on SO). That was an amazing couple of explanations, thanks. I guess my question, then, is this: I assume, then that these statistical packages do not run OLS because there's some fundamental issue with running that with survey-weighted data. I can't seem to figure out, however, what that issue is. — RickyB, Mar 29 '13 at 00:58
Deviance is a generalization of variance, and expected deviance is a generalization of R-square. The problem is that there does not seem to be an easy or general answer to expected deviance, see for instance here: http://stats.stackexchange.com/questions/124306/measure-of-explained-variance-for-poisson-glm-log-link-function/180470#180470 — nukimov, Apr 20 '16 at 11:53

score 3 · Answer 1 · answered Nov 24 '17 at 21:51

From what I can tell, we cannot run an ordinary least squares regression in R when using weighted data and the survey package. Here, we have to use svyglm(), which instead runs a generalized linear model (which may be the same thing? I am fuzzy here in terms of what is different).

svyglm will give you a linear model if you use family = gaussian() which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel.

It seems that the package just makes sure that you use the correct weights when it calls glm. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian(). The result is a weighted linear model. This answer

Why can we not run OLS in the survey package, while it seems that this is possible to do with weighted data in Stata?

by stating that you indeed can do that with the survey package. As for the following question

What is the difference in interpretation between the deviance of a generalized linear model and an r-squared value?

There is a straight forward formula to get the $R^2$ with family = gaussian() as some people have mentioned in the comments. Adding weights does not change anything either as I show below

> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
> 
> # show data and fit model. Notice the R-squared
> head(org) 
   x          y weights
1 -4  0.4963671       1
2 -3 -0.5675720       2
3 -2 -0.3615302       3
4 -1  0.7091697       4
5  0  0.6485203       5
6  1  3.8495979       6
> summary(lm(y ~ x, org, weights = weights))

Call:
lm(formula = y ~ x, data = org, weights = weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-3.1693 -0.4463  0.2017  0.9100  2.9667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.7368     0.3514   4.942  0.00113 ** 
x             0.9016     0.1111   8.113 3.95e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared:  0.8916,    Adjusted R-squared:  0.8781 
F-statistic: 65.83 on 1 and 8 DF,  p-value: 3.946e-05

> 
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
     x          y weights
1   -4  0.4963671       1
2   -3 -0.5675720       2
2.1 -3 -0.5675720       2
3   -2 -0.3615302       3
3.1 -2 -0.3615302       3
3.2 -2 -0.3615302       3
> 
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))

Call:
lm(formula = y ~ x, data = org_redundant)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.19789 -0.29506 -0.05435  0.33131  2.36610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.73680    0.13653   12.72   <2e-16 ***
x            0.90163    0.04318   20.88   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared:  0.8916,    Adjusted R-squared:  0.8896 
F-statistic: 436.1 on 1 and 53 DF,  p-value: < 2.2e-16

> 
> # glm gives you the same with family = gaussian()  
> # just compute the R^2 from the deviances. See 
> #   https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept)           x 
  1.7368017   0.9016347 
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387

The deviance is just the sum of square errors when you use family = gaussian().

Caveats

I assume that you want a linear model from your question. Further, I have never used the survey package but quickly scanned through it and made assumptions about what it does which I state in my answer.

R-squared in linear model verses deviance in generalized linear model?

1 Answers1

Caveats