Which residuals to analyse when dependent variable is transformed?

Question

I am running a multiple linear regression where the dependent variable is sqrt-transformed. As far as I understand, the residuals from the regression are different from the residuals calculated as difference between back-transformed variable and squared model results.

For rough residual analysis (normality, independence, heteroskedasticity,...), should I use the original residuals (transformed) or "back-transformed residuals"? Or should both sets of residuals behave similarly?

Take a look at this answer for an alternative and possibly superior approach: http://stats.stackexchange.com/a/68184/36229 — shadowtalker, Jan 04 '15 at 16:11

score 10 · Accepted Answer · answered Jan 04 '15 at 14:57

For residual analysis, you should use the residuals obtained directly from the regression. No back-transformation is needed. This is because you want to make sure that your regression is valid (that it satisfies the underlying assumptions) which is sort of a "mechanical" issue, not subject-matter issue. Thus you look at the regression and its residuals directly, not at some transformation thereof.

score 4 · Answer 2 · edited Apr 13 '17 at 12:44

You might be better off fitting a generalized linear model instead of a "plain" linear model, and analyzing the residuals of the GLM instead. This procedure and a few good reasons for doing so are laid out in this answer. GLMs have more than one kind of residual, but there is a large literature on analyzing them.

In case you balk at the idea of switching from OLS to ML, or you're hesitant to impose distributional assumptions on the response, consider that regression with OLS is equivalent to a GLM that assumes a normally distributed response and the identity link function.

Moreover, regression models (generalized or not) describe a conditional mean, but making predictions and then un-transforming the predictions does not in general produce a conditional mean for the un-transformed response. In your case, $E(\sqrt{y}) \neq \sqrt{E(y)}$.

(edit/update) Consider a response $y$ and its transformation $y'=\sqrt{y}$. You fit the regression model $$y'=\beta_0 + \beta x + \varepsilon$$ which, if $\operatorname{E}(\varepsilon|x)=0$ (as we assume for OLS), is equivalent to the model $$\operatorname{E}(y'|x) = \operatorname{E}(\sqrt{y}|x) = \beta_0 + \beta x$$

The problem is that $\left(\operatorname{E}(\sqrt{y}|x)\right)^2 \neq \operatorname{E}(y|x)$ in general. Fortunately, in this particular case we can move forward without making any additional assumptions by appealing to the formula $\operatorname{V}(Z) = \operatorname{E}(Z^2) - \left(\operatorname{E}(Z)\right)^2 \implies \operatorname{E}(Z^2) = \operatorname{V}(Z) + \left(\operatorname{E}(Z)\right)^2$, so that

$$\operatorname{E}(y|x) = \operatorname{V}(\sqrt{y}|x) + \left(\operatorname{E}(\sqrt{y}|x)\right)^2$$

and therefore

$$ \widehat{y} = \widehat{\sigma^2} + \left(\widehat{y'}\right)^2 $$

In general, however, you will need to make some more assumptions. If you assume that $(y|x) \sim Normal(\beta_0 + \beta x, \sigma^2)$, which is implicit in OLS, you can usually derive the transformation by applying the Jacobian to the Gaussian PDF and taking its expectation. With a log-transformed response, for instance, the original-scale response variable follows a log-normal distribution, so the correct back-transformation would be $\widehat{y} = e^{\widehat{y'} + \frac{\widehat{\sigma^2}}{2}}$. This particular (and very common) case is demonstrated nicely on David Giles' blog.

Thank you for this recommendation. Actually, I have a large (unbalanced) panel data set and I am running a pooled OLS using the plm package in R. I am using this package since it considers differing numbers of observations and I simply don't know how to do this "manually". Thus I don't know how to incorporate GLM with my data either... — Aki, Jan 05 '15 at 13:15
Referring to the last paragraph in your answer: I think this is one of the problems I am struggeling with at the moment... Does it imply models with sqrt-transformed $y$ being not suitable for predicition? — Aki, Jan 05 '15 at 14:13
@Aki there is a generalization of the GLM called Generalized Estimating Equations (GEE) that can take panel data structures into account, and there is a `pglm` that might help get you started. As for square root transforming the response in a linear model fitted with OLS, I'll edit some more detail into my answer that will hopefully address your question — shadowtalker, Jan 06 '15 at 16:09
Thank you very much for the really helpful answer/explanation. I don't have any experience with GLM or interpretation of its results but I will take a look at it. — Aki, Jan 07 '15 at 07:24
@Aki note that I made a math mistake. I've corrected it now, but I just want you to be aware of it before you go trying to use it — shadowtalker, Jan 07 '15 at 15:44

Which residuals to analyse when dependent variable is transformed?

2 Answers2