How can residuals be iid and sum to zero at the same time?

Question

The formula for linear regression is as follows:

$y_i= \beta_0 + \beta_1 x_i + \epsilon_i$, where $e_i \sim \mathcal{N}(0, \sigma^2)$

Please correct me if the above is wrong.

However, from various posts and notes, I've also read that the residuals of a linear regression (with an intercept term) always sum to zero. Therefore, by definition the residuals are NOT iid. How can $e_i \sim \mathcal{N}(0, \sigma^2)$ and sum to zero at the same time?

I know I am making an incorrect statement somewhere, just not sure where. Thanks.

See also https://stats.stackexchange.com/questions/72392/is-the-residual-e-an-estimator-of-the-error-epsilon https://stats.stackexchange.com/questions/193262/definition-of-residuals-versus-prediction-errors https://stats.stackexchange.com/questions/462588/question-about-regression-error-and-the-residual-maker-matrix — kjetil b halvorsen, Nov 12 '20 at 10:04

score 6 · Answer 1 · answered Nov 11 '20 at 15:56

6

I think you are confusing residuals and errors. Residuals, often noted $\hat{\varepsilon}_i$ or $e_i$ are $$\hat{\varepsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i$$ whereas errors are $$\varepsilon_i = y_i - {\beta}_0 - {\beta}_1 x_i$$ The small (but critical!) difference is the hat over the betas. That's why residuals are often noted with a hat: they are estimates of the errors. The residuals are not independent, since they sum to 0, but the errors are (by assumption of the model).

answered Nov 11 '20 at 15:56

Pohoua

2,003
2
15

1

May I ask a few follow up questions? I am having trouble understanding the concept of "estimating" the parameters. Can I propose an "experiment" of sorts? Let's we state that the true regression line is $Y = 3 + 5X$, and say we uniformly generate some arbitrarily large number of X values between -10 and 10. We take our $x_i$, use our true regression parameters, but then add some random normally distributed error term $e_i$. (a) Would this still be considered a sample regression model instead of a population regression model? – Harshil Garg Nov 11 '20 at 16:58
1

(b) If we did linear regression on this, we would not get exactly 3 and 5 respectively for betas, but how would they be related to 3 and 5? would they be normally distributed around 3 and 5? – Harshil Garg Nov 11 '20 at 17:01
Your estimates would be "close to" 3 and 5, and if you kept your $x_i$s fixed but resimulated the errors several times in order to get several estimates, you would end up with a normal distribution, centred around the true parameters. Concerning your first comment, I do not know the difference between 'sample regression' and 'population regression' models (I actually never heard theses terms). – Pohoua Nov 12 '20 at 16:51

How can residuals be iid and sum to zero at the same time?

1 Answers1