11

Consider a linear regression $$ y=X\beta+\varepsilon. $$ Residuals $e:=y-X\hat\beta$ are often used as substitutes for the unobserved model errors $\varepsilon$ for validating assumptions such as homoskedasticity of $\varepsilon$, normality of $\varepsilon$ and other.

When the model errors $\varepsilon$ are homoskedastic with variance $\sigma^2_\varepsilon$, the residuals $e$ have unequal variances: $\text{Var}(e)=\sigma^2_\varepsilon(I-H)$ where $I$ is an identity matrix and $H:=X(X^\top X)^{-1}X^\top$ is the hat matrix. (For the same reason, the residuals are also correlated.)

The heteroskedasticity in $e$ can be "corrected for"/"undone" by using (internally or externally) studentized residuals $\tilde{e}_{int}:=\frac{e}{\hat\sigma_{int}\sqrt{1-h_{ii}}}$ or $\tilde{e}_{ext}:=\frac{e}{\hat\sigma_{ext}\sqrt{1-h_{ii}}}$ where $\hat\sigma_{int}$ and $\hat\sigma_{ext}$ are internal and external estimates of error variance, respectively.

This heteroskedasticity correction seems to come at zero cost. No estimation or approximation errors seems to be introduced this way, aside from uniform (across data points) scaling due to imperfect estimates $\hat\sigma_{int}$ and $\hat\sigma_{ext}$, and the computational cost is low. This might suggest we should routinely use studentized residuals instead of raw residuals for regression diagnostics, because this way we remedy the problem of heteroskedasticity of $e$ without really sacrificing anything.

Question: Are there any reasons for not using $\tilde{e}_{int}$ or $\tilde{e}_{ext}$ instead of $e$ as a substitute for the true unobserved $\varepsilon$ whenever doing model diagnostics regarding homoskedasticity, normality and other common assumptions/conditions?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • 1
    Some threads on studentized residuals: [Raw residuals versus standardised residuals versus studentised residuals - what to use when?](https://stats.stackexchange.com/questions/22653/), [What's the difference between standardization and studentization?](https://stats.stackexchange.com/questions/99717/), [What advantages do “internally studentized residuals” offer over raw estimated residuals in terms of diagnosing potential influential datapoints?](https://stats.stackexchange.com/questions/44033/). – Richard Hardy May 03 '19 at 13:01
  • Also related: [Homoscedasticity Assumption in Linear Regression vs. Concept of Studentized Residuals](https://stats.stackexchange.com/questions/306735/). – Richard Hardy May 03 '19 at 13:52
  • 1
    This looks like an endemic in statistical literature. I have encountered books and university class web pages that tell students that residuals are IID. http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html, and http://www.restore.ac.uk/srme/www/fac/soc/wie/research-new/srme/modules/mod2/6/index.html, and http://www-personal.umich.edu/~gonzo/coursenotes/file7.pdf. It may have already become a pandemic, that I don't know. – Cagdas Ozgenc Nov 19 '19 at 13:33
  • @CagdasOzgenc, thank you for your comment! Actually, despite years of econometric and statistical education, I was never taught this and eventually discovered this in a random textbook. The facts seem to have been buried too deep in the part of the literature that I was familiar with. Based on what you said, should I conclude that the answer to the question *Are there any reasons...* is a *No*? – Richard Hardy Nov 19 '19 at 13:49
  • To tell you the truth I don't understand why one will prefer to use a wrong distribution. But I have seen it done in so many places that I question myself. I am not an academic or an authority on the matter. It looks to me only externally studentized version is viable as an analytical solution with a t-distribution. What is your opinion? – Cagdas Ozgenc Nov 19 '19 at 13:55
  • @CagdasOzgenc, I thought about it a while ago and have already forgotten some of the context. I do not think I found a definite answer to begin with, which is why I raised the question. As you can see, I did not find any arguments myself for **not** using studentized residuals. (Not sure about internal vs. external standardization.) – Richard Hardy Nov 19 '19 at 14:59
  • A bounty has been wasted on this question. I will wait before offering another one, but if you happen to post an outstanding answer, I will consider awarding the bounty afterwards. – Richard Hardy Feb 12 '20 at 09:43
  • For what it is worth, I do not do regression diagnostics that way. For the types of problems I examine, I use residuals plotted against $x$-values ranks because to do otherwise does not show the trend for underestimation of larger values and overestimation of lesser values that occurs in linear and non-linear OLS in $y$ regression caused by unequal spacing between $x$-values if those distances are indeed unequal. For my problems this relates to surfeit extrapolation error beyond the $x$-value range. In other words, OLS regression is typically monovariate in $y$, and not bivariate in ($x,y$). – Carl Jun 30 '20 at 08:26

1 Answers1

2

$H_{ii}$ is small for large $n$

The magnitude of the diagonal of the hat matrix $H$ decreases quickly with the increase of the number of observations and scales as $1/n$. If we have the matrix $X$ such that the columns are perpendicular then

$$H_{ii} = \frac{X_{i1}^2}{\sum_{j=1}^n X_{j1}^2} + \frac{X_{i2}^2}{\sum_{j=1}^n X_{j2}^2} + \dots + \frac{X_{ip}^2}{\sum_{j=1}^n X_{jp}^2} $$

The mean of the diagonal will be equal to $p/n$*. So the size of the inhomogeneities that are due to the contribution of the diagonal of the hat matrix $H_{ii}$, is of the order of $\sim p/n$.


Diagnostic plots are often with large $n$

For a diagnostic plot one often has a large $n$ (because few points do not really show much of a pattern) and then the contribution of $H_{ii}$ to the variance of $e_i$ will be small and the variance of the $e_i$ will be relatively homogeneous. Or at least the $H_{ii}$ won't contribute much to inhomogeneity.

The effect of $H_{ii}$ is negligible and the reason to not use studentized residuals is simplicity.


*The trace of the projection matrix equals the rank of $X$, and since the rank is often the number of columns $p$ we have $$\sum_{i=1}^n H_{ii} = p$$ Which means that the average $H_{ii}$ is equal to $p/n$

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161