Why can't we use $R^2$ for transformations of dependent variables?

Question

Imagine we have a linear regression model with dependent variable $y$. We find its $R^2_y$. Now, we do another regression, but this time on $\log(y)$, and similarly find its $R^2_{\log(y)}$. I've been told that I can't compare both $R^2$ to see which model is better suited. Why is that? The reason given to me was that we would be comparing the variability of different quantities (different dependent variables). I'm not sure this should be a sufficient reason for this.

Also is there a way to formalise this?

Any help would be appreciated.

I suspect this might have been discussed before on Cross Validated. Have you gone through similar threads thoroughly? Also, do you care about different dependent variables (such as GDP vs. oil price) or transformations of the same variable (GDP vs. GDP growth), or both? — Richard Hardy, May 14 '17 at 08:19
@RichardHardy I found some, but I think they were tangent to my question. Like this one: https://stats.stackexchange.com/questions/235117/comparing-regression-quality-for-different-dependent-variables The answer just states yes, not really explaining why. — An old man in the sea., May 14 '17 at 08:50
@RichardHardy I'm interested for transformations of the dependent variable. — An old man in the sea., May 14 '17 at 08:51

score 8 · Accepted Answer · edited Jun 11 '20 at 14:32

It's a good question, because "different quantities" doesn't seem to be much of an explanation.

There are two important reasons to be wary of using $R^2$ to compare these models: it is too crude (it doesn't really assess goodness of fit) and it is going to be inappropriate for at least one of the models. This reply addresses that second issue.

Theoretical Treatment

$R^2$ compares the variance of the model residuals to the variance of the responses. Variance is a mean square additive deviation from a fit. As such, we may understand $R^2$ as comparing two models of the response $y$.

The "base" model is

$$y_i = \mu + \delta_i\tag{1}$$

where $\mu$ is a parameter (the theoretical mean response) and the $\delta_i$ are independent random "errors," each with zero mean and a common variance of $\tau^2$.

The linear regression model introduces the vectors $x_i$ as explanatory variables:

$$y_i = \beta_0 + x_i \beta + \varepsilon_i.\tag{2}$$

The number $\beta_0$ and the vector $\beta$ are the parameters (the intercept and the "slopes"). The $\varepsilon_i$ again are independent random errors, each with zero mean and common variance $\sigma^2$.

$R^2$ estimates the reduction in variance, $\tau^2-\sigma^2$, compared to the original variance $\tau^2$.

When you take logarithms and use least squares to fit the model, you implicitly are comparing a relationship of the form

$$\log(y_i) = \nu + \zeta_i\tag{1a}$$

to one of the form

$$\log(y_i) = \gamma_0 + x_i\gamma + \eta_i.\tag{2a}$$

These are just like models $(1)$ and $(2)$ but with log responses. They are not equivalent to the first two models, though. For instance, exponentiating both sides of $(2\text{a})$ would give

$$y_i = \exp(\log(y_i)) = \exp(\gamma_0 + x_i\gamma)\exp(\eta_i).$$

The error terms $\exp(\eta_i)$ now multiply the underlying relationship $y_i = \exp(\gamma_0 + x_i\gamma)$. Conseqently the variances of the responses are

$$\operatorname{Var}(y_i) = \exp(\gamma_0 + x_i\gamma)^2\operatorname{Var}(e^{\eta_i}).$$

The variances depend on the $x_i$. That's not model $(2)$, which supposes the variances are all equal to a constant $\sigma^2$.

Usually, only one of these sets of models can be a reasonable description of the data. Applying the second set $(1\text{a})$ and $(2\text{a})$ when the first set $(1)$ and $(2)$ is a good model, or the first when the second is good, amounts to working with a nonlinear, heteroscedastic dataset, which therefore ought to be fit poorly with a linear regression. When either of these situations is the case, we might expect the better model to exhibit the larger $R^2$. However, what about if neither is the case? Can we still expect the larger $R^2$ to help us identify the better model?

Analysis

In some sense this isn't a good question, because if neither model is appropriate, we ought to find a third model. However, the issue before us concerns the utility of $R^2$ in helping us make this determination. Moreover, many people think first about the shape of the relationship between $x$ and $y$--is it linear, is it logarithmic, is it something else--without being concerned about the characteristics of the regression errors $\varepsilon_i$ or $\eta_i$. Let us therefore consider a situation where our model gets the relationship right but is wrong about its error structure, or vice versa.

Such a model (which commonly occurs) is a least-squares fit to an exponential relationship,

$$y_i = \exp\left(\alpha_0 + x_i\alpha\right) + \theta_i.\tag{3}$$

Now the logarithm of $y$ is a linear function of $x$, as in $(2\text{a})$, but the error terms $\theta_i$ are additive, as in $(2)$. In such cases $R^2$ might mislead us into choosing the model with the wrong relationship between $x$ and $y$.

Here is an illustration of model $(3)$. There are $300$ observations for $x_i$ (a 1-vector equally distributed between $1.0$ and $1.6$). The left panel shows the original $(x,y)$ data while the right panel shows the $(x,\log(y))$ transformed data. The dashed red lines plot the true underlying relationship, while the solid blue lines show the least-squares fits. The data and the true relationship are the same in both panels: only the models and their fits differ.

The fit to the log responses at the right clearly is good: it nearly coincides with the true relationship and both are linear. The fit to the original responses at the left clearly is worse: it is linear while the true relationship is exponential. Unfortunately, it has a notably larger value of $R^2$: $0.70$ compared to $0.56$. That's why we should not trust $R^2$ to lead us to the better model. That's why we should not be satisfied with the fit even when $R^2$ is "high" (and in many applications, a value of $0.70$ would be considered high indeed).

Incidentally, a better way to assess these models includes goodness of fit tests (which would indicate the superiority of the log model at the right) and diagnostic plots for stationarity of the residuals (which would highlight problems with both models). Such assessments would naturally lead one either to a weighted least-squares fit of $\log(y)$ or directly to model $(3)$ itself, which would have to be fit using maximum likelihood or nonlinear least squares methods.

The criticism on R^2 is not fair. As every tool it's usage should be well understood. In your examples above R^2 is giving the correct message. R^2 is in a way chosing the better signal to noise ratio. Of course it is not obvious when you put two graphs with totally different scales side by side. In reality the signal on the left is very strong compared to the noise deviations. — Cagdas Ozgenc, May 14 '17 at 16:32
@Cagdas You seem to offer an inherently contradictory message. Since the two plots are *unavoidably* on two different scales--one plots the original responses and the other plots their logarithms--then pleading that something is "not obvious" because of this unavoidable fact does not seem to support your case. Complaining that this answer is "unfair" really doesn't hold up in light of the explicit analysis of the models I have offered. — whuber, May 14 '17 at 17:19
There is no contradition in what I am saying. R^2 choses the higher signal to noise ratio. That's what it is doing. Trying to turn it to something else and claiming that it is not working is outright wrong. All criticisms to R^2 also apply to other goodness of fit indicators when applied to different response variable, but for some reason R^2 is chosen to be the scapegoat. — Cagdas Ozgenc, May 14 '17 at 17:33
I would truly be interested in knowing, @Cagdas, just what part of this analysis you view as "scapegoating" $R^2$. As far as I can tell it is a dispassionate and technically correct assessment of what $R^2$ is and is not capable of accomplishing. I don't see how it's relevant to refer to "signal to noise ratios" when in fact the example explicitly shows how the better model (in the sense I described, which accords with what most people mean by "goodness of fit") produces the worse $R^2$. — whuber, May 14 '17 at 18:08
Thanks for your help whuber. Sorry for the late acceptance, I haven't had a lot of free time lately. ;) — An old man in the sea., May 17 '17 at 20:50

score 0 · Answer 2 · answered Jun 29 '21 at 09:14

I'll give a very non-technical and intuitive answer to this, imagine you have both the linear and log model, and let's say the assumptions of linear regression hold on this model with homoskedastic error terms, the holding of these assumptions imply that your model assumption about the true relationship between the regressor and regressands is verified, however, when you modify this model by taking log of the dependent variable as the new dependent variable, the changing of error terms in inevitable and they may no longer remain homoskedastic, implying that maybe you assumed the wrong true relation, so that even if you have a better R squared on your sample in the log version, in the long term your model may output very erroneous predictions as we use more extreme values owing to the wrong assumption of the true relationship between the variables. Hence, it is wiser to check the assumptions before the R squared. Hope that helps, This is my first answer on this community so I apologise if it does not follow the standards.

Why can't we use $R^2$ for transformations of dependent variables?

2 Answers2

Theoretical Treatment

Analysis