Estimating error - residuals vs. fitted values plot

Question

I just stumbled the residuals vs. fitted values plot which suppose to aid checking if the assumptions being made about the error $\epsilon$ hold.
In particular:

i. $\mathbb{E}\left[\epsilon\right]=0$
ii. $\forall\,\,i\,\,:\,\,Var\left(\epsilon_{i}\right)=\sigma^{2}$ (homoscedasticity)

In order to check (i) we plot $y = 0$ and in order to check (ii) we examine if the spread of the dot's $y$ values changes as we move along the $x$ axis (assuming $e$ is an estimator for $\epsilon$).

I wonder how it's even possible to get a plot that deviates from i + ii since from the mathematical construction of the OLS (with no statistical assumptions) we are assured that $\sum_{i}e_{i}=0$ - meaning the residuals will always average on $y = 0$, and $Cov\left(\hat{y},e\right)=0$ meaning we shouldn't see that a change in $\hat{y}$ effects $e$.

This seems to make the Residuals Vs Fitted (RVF) plot tautological. I probably got wrong in the way, but not sure where..

Edit: A screen taken from lecture notes. From here it goes to plotting the RVF plot with the line $y=0$ to check if the mean of $e$ is centered around 0.

I've never seen any reference suggesting that the residuals vs fitted plot is used to check the plausibility of $\mathbb{E}[\epsilon]=0$, and I would really be surprised to see one that does. It's used to check the plausibility of $\mathbb{E}[\epsilon|X]=0$. Remember - the real assumption of the linear regression model is not that the mean of the response is a linear function of the predictors. It's the assumption that the *conditional mean with respect to the predictors* of the response is is a linear function of the predictors. Which is the same than saying that $\mathbb{E}[\epsilon|X]=0$. — DeltaIV, May 27 '17 at 20:51
@DeltaIV added capture from lecture notes. You are right that it's the conditional mean, still it's not clear why testing it through plotting the residuals is the way to go (by OLS mean $e$ is zero, isn't it?) — RiskyMaor, May 27 '17 at 21:26
No. By OLS, $\sum e_i=0$, which is not the same as saying that $\mathbb{E}[e_i]=0$. $\mathbb{E}[e_i]=0$ comes from assuming that the linear regression *model* is true, which is different from the estimation algorithm (OLS). And anyway all this is irrelevant: the residual vs fitted plot is not used to check the assumption that $\mathbb{E}[\epsilon_i]=0$, but the assumption that $\mathbb{E}[\epsilon_i|X]=0$. If the linear regression model is not true, it's very often the case that $\mathbb{E}[e_i|X]\neq 0$ **and** $\mathbb{E}[e_i] \neq 0$ ctd — DeltaIV, May 27 '17 at 21:41
ctd even though for each sample, it's always true that the sum of the residuals is zero. You're confounding random variables with their realizations. If you reflect a bit more, I think you'll find it self-evident. Otherwise, I'll try to write a full answer in the next days. — DeltaIV, May 27 '17 at 21:43
BTW, it's difficult to say from just the short excerpt you're showing, but if your lecture notes are implying that residuals are realizations of errors, then you definitely need to find yourself better lecture notes. Residuals are **NOT** realizations of errors, not even when the linear model is true, and failing to note the difference leads to much confusion. See for example the very appropriate comment of @Silverfish to this answer: https://stats.stackexchange.com/a/148631/58675 — DeltaIV, May 27 '17 at 21:49

DeltaIV · Accepted Answer · 2017-05-30T06:29:40.087

From your question and your comments, I have the impression that you are confusing different things. Let $Y$ be your response random variable and $\mathbf{X}$ your random vector of predictors. Let's do things in order:

What is the first property that we should check, for the validity of the linear model?

Suppose we want to estimate $Y$ given $\mathbf{X}$. We may write:

$$Y=\mathbf{X}\boldsymbol{\beta}+\epsilon$$

where for now we have simply defined $\epsilon$ as equal to $Y-\mathbf{X}\boldsymbol{\beta}$ for some parameter vector $\boldsymbol{\beta}$. Such a definition is always possible, whether the linear model is correct or not. However, if there is a fixed (not random) but unknown parameter vector such that $\mathbb{E}[Y|\mathbf{X}]=\mathbf{X}\boldsymbol{\beta}$, then it's immediate to prove that $\mathbb{E}[\epsilon|\mathbf{X}]=\mathbb{E}[Y-\mathbf{X}\boldsymbol{\beta}|\mathbf{X}]=0$, and from this, using the law of iterated expectations, you can prove that $\mathbb{E}[\epsilon]=0$. Vice versa, if we only know that $\mathbb{E}[\epsilon]=0$, this doesn't imply that the conditional mean of $Y$ given $\mathbf{X}$ is a linear function of predictors, which is the real, most basic property of a linear model. Thus the assumption whose plausibility we should check is $\mathbb{E}[\epsilon|\mathbf{X}]=0$, and from this $\mathbb{E}[\epsilon]=0$ would follow. For this reason, in the following I will assume as definition of linear model the property $\mathbb{E}[Y|\mathbf{X}]=\mathbf{X}\boldsymbol{\beta}$, even though we usually add other properties to the definition of a linear model (homoskedastic, iid, Gaussian errors).

How do we check that $\mathbb{E}[\epsilon|\mathbf{X}]=0$ is (approximately) valid?

Denote with $x_i$ the $i-$th realization of the random variable $X$. Even if we draw a random sample of size $N$ from the joint distribution of $Y$ and $\mathbf{X}$, i.e., $S=\{(y_i,\mathbf{x}_i)\}_{i=1}^N$, we don't get also a random sample for $\epsilon$, because $\boldsymbol{\beta}$ is unknown. For example, if I want to compute $\epsilon_1$ corresponding to $(y_1,\mathbf{x}_1)$, I can't because in the equation $\epsilon_1=y_1-\boldsymbol{\beta}\mathbf{x}_1$ I don't know $\boldsymbol{\beta}$. However, given $S$ I can use the OLS estimator to get an estimate $\hat{\boldsymbol{\beta}}$ of $\boldsymbol{\beta}$. I then have the $N$ quantities $e_1=y_1-\hat{\boldsymbol{\beta}}\mathbf{x}_1,\dots,e_N=y_N-\hat{\boldsymbol{\beta}}\mathbf{x}_N$. These quantities, called residuals, are different from the (unknown) quantities $\epsilon_1=y_1-\boldsymbol{\beta}\mathbf{x}_1,\dots,\epsilon_N=y_N-\boldsymbol{\beta}\mathbf{x}_N$. They are realizations of the random variable

$$E=Y-\hat{\boldsymbol{\beta}}\mathbf{X}$$

not realizations of $\epsilon$. However, since they are the only quantities we have access to, and since they can be seen as estimates of the errors, we can use the residuals vs fitted plot to check whether the assumption $\mathbb{E}[\epsilon|\mathbf{X}]=0$ is at least plausible. Actually, we should plot the residuals vs the predictors, not vs the fitted values. However, this plot would be impossible to visualize if we had more than 2 predictors. One can prove that if $\mathbb{E}[\epsilon|\mathbf{X}]=0$, then also $\mathbb{E}[\epsilon|Y]=0$ and $\mathbb{E}[E|Y]=0$. For this reason, as a workaround we look at the residuals vs fitted plot, which is always a 2D plot.

NOTE: it's true that, for the properties of the OLS estimator, $\sum e_i=0$, for any sample size, irrespective of whether the linear model is true or not, i.e., whether $\mathbb{E}[Y|\mathbf{X}]=\mathbf{X}\boldsymbol{\beta}$ or not. Applying a Law of Large Numbers argument, this implies also that $\mathbb{E}[E]=0$, even if $\mathbb{E}[Y|\mathbf{X}]\neq\mathbf{X}\boldsymbol{\beta}$, but it doesn't imply $\mathbb{E}[E|\mathbf{X}]=0$, unless we also know that $\mathbb{E}[Y|\mathbf{X}]=\mathbf{X}\boldsymbol{\beta}$. Let's show it with two examples.

An example where the linear model is valid

Let $Y=2X+\epsilon$ with $X$ and $\epsilon$ independent, and $E[\epsilon]=0$. Then $E[\epsilon|X]=E[\epsilon]=0$ and the linear model is valid, i.e., $\mathbb{E}[Y|X]=2X$. Let's get a sample of size $N=1000$:

N <- 100
x <- runif(N,0,10)
epsilon <- rnorm(N,0,0.1)
y <- 2*x + epsilon
fit <- lm(y~x)

We can check that $\sum_{i=1}^N e_i=0$:

sum(residuals(fit))
#[1] -2.831936e-15

and also that in the residuals vs fitted plot the nonparametric (loess) estimate of the conditional mean of the residuals is very close to 0.

plot(fit)

An example where the linear model is not valid (or is it)?

Let $Y=\sin(X)+U$ with $X$ and $U$ independent, and $E[U]=0$. Again, it's true that $E[U|X]=E[U]=0$. However, this time there's no $\beta$ such that $\mathbb{E}[Y|X]=\beta X$. Let's get a sample of size $N=1000$:

N <- 1000
x <- runif(N, 0, 10)
epsilon <- rnorm(N, 0, 0.1)
y <- sin(x) + epsilon
fit <- lm(y~x)

As before, we must have $\sum_{i=1}^N e_i=0$:

sum(residuals(fit))
#[1] 2.20414e-14

However, the conditional mean of the residuals is clearly not constant and equal to 0:

Thus the linear model is not valid, even if $\sum_{i=1}^N e_i=0$. QED. By the way, note that in this particular case we could still get a linear model, if in our list of predictors we added $Z=\sin(X)$. This shows the strength of the linear model: by choosing the right basis functions, we can model a lot of different problems.

Following your explanation, isn't it more sensible to check the assumption $\mathbb{E}\left[\epsilon|\boldsymbol{X}\right]=c$ where $c$ is constant since it seem that's what we are really checking ? This means we'll be assured about the linearity of the coefficients, but are probably wrong in the estimation of the intercept. — RiskyMaor, May 31 '17 at 07:54
@MaorSH not at all. Dots are always above and below the line - it's just the sum of the ordinates of the dots from the $\hat{y}=0$ line which is zero. Have you looked at the first graph in my answer? You can clearly see dots are both above and below $\hat{y}=0$. — DeltaIV, May 31 '17 at 11:00
let me rephrase: I'm not asking about the graph you gave, I'm asking about a theoretical graph where all of the residuals are above $y=0$. In my understanding it's impossible to witness whether the linear model is true or not. Here's an example: http://imgur.com/fQmWmkH — RiskyMaor, May 31 '17 at 11:08
@MaorSH if the model was estimated by OLS, then $\sum e_i = 0$, which is impossible if $e_i >0 \ \forall i$. So your graph is impossibile, unless you are plotting **the absolute values** of the residuals. Anyway, your question doesn't ask about a case where all residuals are positive. Site policy is one question for post. If you have other questions, please ask a new question instead of adding questions into comments. — DeltaIV, May 31 '17 at 12:37
@MaorSH also, in Statistics a linear model is actually meant to be an **affine** model. In other words, the intercept is included: when I say that $\mathbb{E}[Y|\mathbf{X}]=\mathbf{X}\boldsymbol{\beta}$, I actually mean that $\mathbb{E}[Y|\mathbf{X}]=\beta_0 1+\beta_1 X_1+\dots+\beta_p X_p$. You could say that we added a constant predictor equal to 1 to the predictors list. Though it may sound weird to a newcomer, it's actually standard practice in Statistics and Machine Learning. — DeltaIV, May 31 '17 at 13:06

Estimating error - residuals vs. fitted values plot

1 Answers1

What is the first property that we should check, for the validity of the linear model?

How do we check that $\mathbb{E}[\epsilon|\mathbf{X}]=0$ is (approximately) valid?

An example where the linear model is valid

An example where the linear model is not valid (or is it)?

Linked