Let $y_i-\hat y_i=\hat {\epsilon}_i$ the residual of the linear regressión where $\hat y_i=X\hat{\beta}$. Are the residuals a random variable? My intuition says yes. $\hat {\epsilon}$ is an estimator of ${\epsilon}$ and, hence, a function of other random variables (specifically $X_i$ and $Y_i$ for $i=1,...n$).

- 54,375
- 10
- 95
- 219

- 184
- 7
-
2Yes, it is. $\epsilon$ is obtained by using the algebra of random variables. – osmoc Aug 14 '20 at 13:06
-
5You misuse the term "estimator." By definition, an estimator is a function of a (presumably) random sample that is used to make a guess about a property of the underlying distribution. The $\epsilon$ are not distributional properties: they are random variables. – whuber Aug 14 '20 at 14:32
-
If I define estimator as a rule for estimation, or short of a definition, just use the term in that way, I am being vague and general. But what makes that definition wrong? – Nick Cox Aug 16 '20 at 10:58
-
See also https://stats.stackexchange.com/questions/133389/what-is-the-difference-between-errors-and-residuals – kjetil b halvorsen Aug 16 '20 at 15:55
1 Answers
Let's say that your model is $$y=X\beta+\epsilon,\quad E[y]=X\beta,\quad \epsilon\sim N(0,\sigma^2 I).$$ You estimate the $\beta$ coefficients by $$\hat\beta=(X'X)^{-1}X'y$$ and you get $$\hat{y}=Hy,\quad H=X(X'X)^{-1}X'$$ where $H$ is a symmetric idempotent matrix, and $$\hat\epsilon=y-Hy=(I-H)y,\quad E[\hat\epsilon]=0,\quad \text{Cov}(\hat\epsilon)=(I-H)\sigma^2.$$ You can see that, while the errors are independent and homoscedastic, the residuals are neither independent ($I-H$ is not a diagonal matrix) nor homoscedastic (the diagonal elements of $I-H$ are not equal). Moreover, the residuals' variance and covariance depend on $H$, therefore on your data $X$.
The residual vector is a transformation of $\epsilon$: \begin{align*} \hat\epsilon &= (I-H)y=(I-H)X\beta+(I-H)\epsilon\\ &=[X-X(X'X)^{-1}(X'X)]\beta+(I-H)\epsilon\\ &=(I-H)\epsilon \end{align*} so it is a random variable, but is not an estimator of $\epsilon$.
EDIT
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data. For example, if $X_1,\dots,X_n$ is a random sample, you can calculate the sample mean, i.e. the mean of observed realizations of $X_1,\dots,X_n$, to estimate $E[X]$.
Since the error term is unobserved and unobservable, the residuals are not and cannot be observed realizations of the error term, $\hat\epsilon$ is not and cannot be an estimator of $\epsilon$ (I'm using your phrasing here, look at whuber's enlightening comments.)
However, since the residual random vector is a transformation of $\epsilon$, a transformation which depends on your model, you can use $\hat\epsilon$ as a proxy for the error term, where "proxy" means: an observed variable that is used in place of an unobserved variable (clearly, proxy variables are not estimators.)
If your residuals behave as you would expect from the error term, then you can hope that your model is 'good'. If residuals are 'strange', you do not think that you have estimated a 'true' strange error term: you think that your model is wrong. For example, the error term in your model is not a 'true' error term, but depends on missing transformations of predictors or outcome, or on omitted predictors (you can find several examples in Weisberg, Applied Linear Regression, chap. 8.)
Let me stress this point. You get some residuals, if you like them then you accept them, otherwise you change your model, i.e. you change $X$, therefore $H$, therefore $I-H$, therefore $(I-H)\epsilon$. If you don't like the residuals you get, then you change them. Rather a bizarre "estimator"! You keep it if you like it, otherwise you change it, and change it again, until you like it.
If you were sure that your model is the 'true' model, you could think of your residuals as (improper) estimators of the error term, but you'll never know that your model is 'true'. Thinking that the residuals estimate the errors is wishful thinking. IMHO, of course.
EDIT 2
We need an estimate of $\sigma^2$ to obtain an estimation of the covariance matrix of $\hat\beta$. And we actually use residuals.
Let's recall that the residuals are not an estimator of the error term, because:
- an estimator is a function of observable random variables, and an estimate is a function of their observed realized values, but the error term is unobservable;
- the error term is a random variable, is not a distributional property (see whuber's comments);
- the $\hat\epsilon$ random variable is a transformation of $\epsilon$, a transformation which depends on the model;
- if the model is correctly specified, the consistency of $\hat\beta$ implies that $\hat\epsilon\rightarrow\epsilon$ as $n\rightarrow\infty$, but the finite-sample properties of $\hat\epsilon$ always differ from those of $\epsilon$ (residuals are correlated and heteroscedastic).
Moreover, $\text{Var}(\hat\epsilon_i)=(1-h_{ii})\sigma^2$, where $h_{ii}$ is a diagonal element of $H$ and $1-h_{ii}<1$, so the variance of $\hat\epsilon_i$ is less than $\sigma^2$ for every $i$.
However, if the model is correctly specified, then we can use the method of moments to get a biased estimator of $\sigma^2$: $$\hat\sigma^2=\frac{1}{n}\sum_i\hat\epsilon_i^2,\quad E[\hat\sigma^2]=\frac{n-k}{n}\sigma^2$$ and the unbiased estimator is $$s^2=\frac{1}{n-k}\sum_i\hat\epsilon_i^2$$ where $k$ is the number of columns of $X$, the number of elements in $\beta$.
But this is a very strong assumption. For example, if the model is overspecified, if we include irrelevant predictors, the variance of $\hat\beta$ will increase. If the model is underspecified, if we omit relevant predictors, $\hat\beta$ will generally be biased and inconsistent, the covariance matrix for $\hat\beta$ will be incorrect (see Davidson & MacKinnon, Econometric Theory and Methods, chap. 3 for more details.)
Therefore, we can't use residuals as proper estimators of the error term or of its distributional properties. At first, we must use residuals to "estimate" (loosely speaking) the "goodness" of our model, and eventually to change it, then we use residuals as a transformation of the error term, as observable quantities in place of unobservable realizations of the error term, hoping that the transformation is "good enough", that we can indirectly get a reasonable estimate for $\sigma^2$.

- 5,628
- 2
- 11
- 27
-
8You were going fine until the last sentence. If the residuals don't estimate the errors, there is no point to them. – Nick Cox Aug 14 '20 at 08:21
-
You can use observed $\hat\epsilon$ as a proxy for unobserved $\epsilon$. – Sergio Aug 14 '20 at 08:40
-
2Is a proxy different from an estimator? You might wish to explain that in more detail. I see the $\hat \epsilon$ just as a projection of the $\epsilon$ onto the space orthogonal to the column space of $X$, and an estimator for $\epsilon$ (at least, it is used an estimator for the parameters of it's distribution $\epsilon \sim N(0,\sigma^2I)$, but I see not why it would not work for $\epsilon$ as well) – Sextus Empiricus Aug 14 '20 at 08:51
-
Why? Because it can work if you are sure that your model is the 'true' model, but you never know. – Sergio Aug 14 '20 at 09:03
-
4@Sergio On that logic I can't see that anything qualifies for you as an estimator, even a lousy one. – Nick Cox Aug 14 '20 at 09:43
-
2I’m with @NickCox. Even $6$ can be an estimator (might even be admissible). – Dave Aug 14 '20 at 12:53
-
1Responding to the revised answer: There are issues here on various levels. (1) You can use terminology with your own idiosyncratic senses, but that can raise problems in communication in making yourself understood. (2) There are subtleties enough here for there to be no need to add further nuances. If I am estimating a mean from a sample, the mean being estimated is every bit as unobserved and unobservable as the deviations from it. Incidentally, it is possible to talk of estimation without defining estimators: R.A. Fisher thought the term quite unnecessary. – Nick Cox Aug 14 '20 at 13:07
-
I've often found the term _proxy_ useful in talking about measurement. For example, length of mercury thread in a glass tube can be a good proxy for temperature within a specific range. That doesn't stop the term having other useful applications, but I don't find that true here. – Nick Cox Aug 14 '20 at 13:10
-
1Re " $\hat\epsilon$ is not and cannot be an estimator of $\epsilon:$" I think the underlying idea is correct and well thought out, but this is a confusing way of expressing it. $\epsilon$ cannot be the target of any estimator because it is not a distributional property. This fact has nothing to do with $\hat\epsilon.$ Consider, therefore, removing the reference to $\hat\epsilon$ from this statement. – whuber Aug 14 '20 at 14:34
-
1The term "BLUP" stands for "Best Linear Unbiased Predictor." It is used in random effects models to "estimate" the random effects. Because of problems with the term "estimation" as noted in the comments, the word "prediction" is used instead. The same could be done here: $\hat \epsilon$ can be said to be a *predictor* of $\epsilon$. – BigBendRegion Aug 14 '20 at 22:24
-
I would never deny the role of rigorous, formal theory but also am not a person to attempt its style. Estimation I regard as (ideally smart) guessing at unknown quantities and errors are unknown quantities which residuals estimate. A different sample, different residuals. A different model, different residuals. A different definition of residuals, different residuals. What else is new? Then again, as an applied person, I am happy to think about sampling distributions of (e.g.) scatter plots, which some precisians want to regard as an abuse of terminology. – Nick Cox Aug 15 '20 at 11:46
-
+1 Great answer. Shouldn't this part "the residuals are neither **independent** (I−H is not a diagonal matrix)" say "the residuals are neither **uncorrelated** (I−H is not a diagonal matrix)" instead? – ColorStatistics Feb 18 '21 at 15:44
-
@ColorStatistics since $e\sim N(0,\sigma^2(I-H))$, the residuals are independent if and only if they are uncorrelated. – Sergio Feb 18 '21 at 18:31
-
I see, so you're saying that we know that the residuals are jointly normally distributed and uncorrelated because (I-H) is not a diagonal matrix, hence the residuals are independent. Did I get that right? – ColorStatistics Feb 18 '21 at 18:38
-
@ColorStatistics We *assume* that the residuals are jointly normally distributed :) – Sergio Feb 18 '21 at 19:37
-
@Sergio: I replaced my old comment with this one: We do not assume anything about the residuals, do we? We derive/infer the distribution of the residuals from the fact that the residuals are functions of the errors, whose distribution we've assumed. Let me know if you disagree. – ColorStatistics Feb 19 '21 at 19:19