I know this might be a very basic question for anyone, but I'm not sure how to answer it correctly. It was recently asked at an interview. It would be great if someone could help me with answering this.
-
Need much more information that what you have given to provide a concrete answer, but I have written up a solution that should help you gain intuition. – Greenparker Mar 10 '16 at 03:11
-
If you didn't start with "Under what assumptions?" you would likely have been wrong anyway – Glen_b Mar 10 '16 at 03:12
1 Answers
Assuming that the data is coming from a normal distribution, $$ X_1, X_2, \ldots, X_n \sim N(\mu, \sigma^2).$$
We obtain an estimate of $\mu$, $$\bar{X}_n = \dfrac{1}{n} \sum_{i=1}^{n} X_i. $$
The true errors are then $X_i - \mu$ and the estimated errors are $X_i - \bar{X}_n$. So the sum of squared (true) errors are $\sum_{i=1}^{n} (X_i - \mu)^2$.
Note that each $$\dfrac{X_i - \mu}{\sigma} \sim N(0,1) $$ and so $$\dfrac{1}{\sigma^2} \sum_{i=1}^{n}(X_i - \mu)^2 \sim \chi^2_n. $$
Thus, the sum of squared errors is distributed as a $\chi^2_n$ scaled by $\sigma^2$.
However, as mentioned in the comments, the true sum of squared errors cannot be realized in an estimate since $\mu$ is unknown, and so it is estimated with $\sum_{i=1}^{n} (X_i - \bar{X}_n)^2$. This is harder to find since now $\bar{X}_n$ is from the whole sample and is dependent on each $X_i$. Thus the same steps as above do not work. Intuitively, you lose one degree of freedom in estimating $\mu$, and $$\sum_{i=1}^{n} \dfrac{X_i - \bar{X}_n}{\sigma^2} \sim \chi^2_{n-1}. $$
On how to get to this point, some details can be found here: Distribution of sum of squares error for linear regression?.

- 14,131
- 3
- 36
- 80
-
It's probably worth mentioning the distribution of $\sum_{i=1}^{n}(X_i - \bar{X})^2$ too. – dsaxton Mar 10 '16 at 03:25
-
Sorry about deleting that when you were replying. I realized you had simply taken the question differently than I had and withdrew the complaint. But now if you take it to be the sum of squares of unobservable errors, why mention the estimated errors? There's no simple, obvious path from the easy part you did to the other case. Someone who can't do what you did won't make the leaps required to what you didn't do. I have no big issue if you take the question to be purely about $X_i-\mu$ or about $X_i-\bar{X}$ but it's not really the case that showing the first makes the second easy. – Glen_b Mar 10 '16 at 03:25
-
(ctd) ... If you still want your answer to cover the other case you'd probably at least need to outline how it's done, or at least take dsaxton's suggestion and mention what the distribution is. – Glen_b Mar 10 '16 at 03:32