5

The formula for linear regression is as follows:

$y_i= \beta_0 + \beta_1 x_i + \epsilon_i$, where $e_i \sim \mathcal{N}(0, \sigma^2)$

Please correct me if the above is wrong.

However, from various posts and notes, I've also read that the residuals of a linear regression (with an intercept term) always sum to zero. Therefore, by definition the residuals are NOT iid. How can $e_i \sim \mathcal{N}(0, \sigma^2)$ and sum to zero at the same time?

I know I am making an incorrect statement somewhere, just not sure where. Thanks.

  • See also https://stats.stackexchange.com/questions/72392/is-the-residual-e-an-estimator-of-the-error-epsilon https://stats.stackexchange.com/questions/193262/definition-of-residuals-versus-prediction-errors https://stats.stackexchange.com/questions/462588/question-about-regression-error-and-the-residual-maker-matrix – kjetil b halvorsen Nov 12 '20 at 10:04

1 Answers1

6

I think you are confusing residuals and errors. Residuals, often noted $\hat{\varepsilon}_i$ or $e_i$ are $$\hat{\varepsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i$$ whereas errors are $$\varepsilon_i = y_i - {\beta}_0 - {\beta}_1 x_i$$ The small (but critical!) difference is the hat over the betas. That's why residuals are often noted with a hat: they are estimates of the errors. The residuals are not independent, since they sum to 0, but the errors are (by assumption of the model).

Pohoua
  • 2,003
  • 2
  • 15
  • 1
    May I ask a few follow up questions? I am having trouble understanding the concept of "estimating" the parameters. Can I propose an "experiment" of sorts? Let's we state that the true regression line is $Y = 3 + 5X$, and say we uniformly generate some arbitrarily large number of X values between -10 and 10. We take our $x_i$, use our true regression parameters, but then add some random normally distributed error term $e_i$. (a) Would this still be considered a sample regression model instead of a population regression model? – Harshil Garg Nov 11 '20 at 16:58
  • 1
    (b) If we did linear regression on this, we would not get exactly 3 and 5 respectively for betas, but how would they be related to 3 and 5? would they be normally distributed around 3 and 5? – Harshil Garg Nov 11 '20 at 17:01
  • Your estimates would be "close to" 3 and 5, and if you kept your $x_i$s fixed but resimulated the errors several times in order to get several estimates, you would end up with a normal distribution, centred around the true parameters. Concerning your first comment, I do not know the difference between 'sample regression' and 'population regression' models (I actually never heard theses terms). – Pohoua Nov 12 '20 at 16:51