Calculating R-Squared with logged data

Question

I have created an example in R to illustrate the problem:

> set.seed(10)
> Ydata<-rnorm(200,15,5)*rep(1:200)^3
> Xdata<-rep(1:200)

> lm.test<-lm(log(Ydata)~Xdata)
> summary(lm.test)$r.squared 

[1] 0.7665965

> Yfit<-fitted.values(lm.test)
> lm.test2<-lm(Yfit1~log(Ydata))
> summary(lm.test2)$r.squared 

[1] 0.7665965

> ExpYfit<-exp(fitted.values(lm.test))
> lm.test3<-lm(ExpYfit~Ydata)
> summary(lm.test3)$r.squared

[1] 0.6088178

When calculating the r-squared of some exponential model, fitted values for log(Y) run against observed log(Y) give the same r-squared as the original regression as expected:

log(Y) = fitted values = a + bX

but when we want to estimate the level of Y, exponentials of both sides are taken:

Y= exp(a + bX) = exp(fitted values)

but when running level Y against exponential fitted values, the R-squared is calculated incorrectly.

Why is this? and does this mean my predictions of Y are wrong?

score 1 · Accepted Answer · answered Aug 10 '15 at 12:23

if you regress a variable $y$ in an independent variable $x$ and $\hat{y}$ are the fitted values, then, by definition, $R^2=\frac{\sum_i (\hat{y}_i-\bar{y})^2}{\sum_i (y_i-\bar{y})^2}$, so it is the 'explained sum of squares' divided by the 'total sum of squares' for the dependent variable ($\bar{y}$ is the average of $y$).

Obviously, if you perform some transformation on $y$, let's denote it $z=f(y)$, then when you regress $z$ on $x$ the $R^2_{new}$ is $R^2_{new}=\frac{\sum_i (\hat{z}_i-\bar{z})^2}{\sum_i (z_i-\bar{z})^2}=\frac{\sum_i (\widehat{f(y)}_i-\overline{f(y)})^2}{\sum_i (f(y)_i-\overline{f(y)})^2}$, which is, in general, different for the first value, so you may never compare $R^2$ values of transformations of the dependent variable, you can only compare $R^2$ of models with the same dependent variable.

It will also be useful to see the difference between : log(Ydata)~Xdata and Ydata~Xdata with a "log" link function. Start here: http://stats.stackexchange.com/questions/43930/choosing-between-lm-and-glm-for-a-log-transformed-response-variable — AntoniosK, Aug 10 '15 at 12:39
Yes when I calculate the r-squared with 1-SSR/SST I get a totally different value. It now makes sense that I shouldn't expect them to be the same :) — Chris, Aug 10 '15 at 13:04

Calculating R-Squared with logged data

1 Answers1

Linked