I need to calculate the $R^2$ for a specific regression model in R, since the specific function I'm using doesn't return it. I'm trying to do so based on this excellent answer, but am still unable to replicate the results I get using lm. Below, a minimal example based on a slice of my actual data.
Note: I understand my $R^2$ may be misleadingly high, since I'm running a model without an intercept. Regardless, I can't replicate R's output even when I include the intercept.
> # A slice of my data
> df
# A tibble: 20 x 3
y x w
<dbl> <chr> <dbl>
1 0.212 c 1.44
2 0.419 c 0.758
3 -0.579 c 1.38
4 -0.523 a 0.419
5 -0.833 a 0.610
6 -0.686 a 2.06
7 -0.173 a 1
8 0.0649 c 1.16
9 -1.04 b 0.753
10 -0.913 b 2.46
11 0.189 a 1
12 -0.382 c 1
13 -0.770 b 1
14 -0.611 b 0.392
15 -0.536 c 1
16 -0.199 a 1
17 -0.968 b 0.631
18 -1.89 b 1
19 -0.511 c 0.898
20 0.581 c 0.775
>
> # Regression without intercept
> reg <- lm(y ~ 0 + x, df, weights = w)
> summary(reg)
Call:
lm(formula = y ~ 0 + x, data = df, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.8385 -0.3569 0.1230 0.2743 0.6189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
xa -0.3816 0.1729 -2.206 0.0414 *
xb -1.0475 0.1710 -6.125 1.12e-05 ***
xc -0.1226 0.1472 -0.833 0.4164
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4269 on 17 degrees of freedom
Multiple R-squared: 0.7171, Adjusted R-squared: 0.6671
F-statistic: 14.36 on 3 and 17 DF, p-value: 6.481e-05
>
> # Calculating R^2
> rss <- sum((reg$residuals * df$w)^2)
> tss <- sum((df$y * df$w - weighted.mean(df$y, df$w))^2)
> 1-rss/tss
[1] 0.6502406
>
>
> # I can't replicate the results even when I include the intercept
> reg2 <- lm(y ~ x, df, weights = w)
> summary(reg2)
Call:
lm(formula = y ~ x, data = df, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.8385 -0.3569 0.1230 0.2743 0.6189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3816 0.1729 -2.206 0.0414 *
xb -0.6660 0.2432 -2.738 0.0140 *
xc 0.2590 0.2271 1.140 0.2699
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4269 on 17 degrees of freedom
Multiple R-squared: 0.5034, Adjusted R-squared: 0.445
F-statistic: 8.616 on 2 and 17 DF, p-value: 0.002607
>
> # Calculating R^2
> rss <- sum((reg2$residuals * df$w)^2)
> tss <- sum((df$y * df$w - weighted.mean(df$y, df$w))^2)
> 1-rss/tss
[1] 0.6502406
Edit
As pointed out by @whuber, I had en error in my rss formula. I then went and looked at the code for summary.lm. It calculates the $r^2$ using the mean sum of squares, rather than the total sum of squares. I can, thus, replicate R's results if I use the mss, rather than the tss. What I still don't understand is that, theoretically, the TSS should be equal to RSS + MSS, but I'm not getting that. Code below:
> # I can replicate it when I use the mean sum of squares
> rss <- sum(df$w * reg$residuals^2)
> mss <- sum(df$w * reg$fitted.values^2)
> mss/(rss + mss)
[1] 0.7170557
>
> # or, alternatively
> 1-rss/(rss+mss)
[1] 0.7170557
>
> # What I don't understand now is why I can't use the total sum of squares
> tss <- sum(df$w * (df$y - weighted.mean(df$y, df$w))^2)
> 1-rss/tss
[1] 0.5033989
>
> # The TSS should be equal to RSS + MSS
> rss+mss
[1] 10.9486
> tss
[1] 6.238093
>