Proof/Derivation of Residual Sum of Squares (Based on Introduction to Statistical Learning)

Question

On page 19 of the textbook Introduction to Statistical Learning (by James, Witten, Hastie and Tibshirani--it is freely downloadable on the web, and very good), the following is stated:

Consider a given estimate $$\hat{Y} = \hat{f}(x)$$ Assume for a moment that both $$\hat{f}, X$$ are fixed. Then, it is easy to show that:

$$\mathrm{E}(Y - \hat{Y})^2 = \mathrm{E}[f(X) + \epsilon - \hat{f}(X)]^2$$ $$ = [f(X) - \hat{f}(X)]^2 + \mathrm{Var}(\epsilon)$$

It is further explained that the first term represents the reducible error, and the second term represents the irreducible error.

I am not fully understanding how the authors arrive at this answer. I worked through the calculations as follows:

$$\mathrm{E}(Y - \hat{Y})^2 = \mathrm{E}[f(X) + \epsilon - \hat{f}(X)]^2$$

This simplifies to $[f(X) - \hat{f}(X) + \mathrm{E}[\epsilon]]^2 = [f(X) - \hat{f}(X)]^2$ assuming that $\mathrm{E}[\epsilon] = 0$. Where is the $\mathrm{Var}(x)$ indicated in the text coming from?

Any suggestions would be greatly appreciated.

Because this is from a textbook, you should add the `self-study` tag to your question. See http://stats.stackexchange.com/tags/self-study/info — Patrick Coulombe, Jul 31 '14 at 20:05
Your notation is mystifying because $\mathrm{E}(Y - \hat{Y})^2 = \mathrm{E}[f(X) + \epsilon - \hat{f}(X)]^2$ literally means the square of the expectation. Assuming $\mathrm{E}(\epsilon)=0$, this immediately reduces to $(f(X)-\hat{f}(X)+\mathrm{E}(\epsilon))^2$ = $(f(X)-\hat{f}(X))^2$. Evidently, then, what you really want to compute is the expectation of the square, $\mathrm{E}[(f(X)-\hat{f}(X)+\epsilon)^2]$. But if so, the very first step in your derivation makes no sense. Could you edit the question to clear this up? — whuber, Jul 31 '14 at 20:30
Hmm.. I see what you mean. I didn't see that simplification at first (i.e. that $E[f(X)+\epsilon - \hat{f}(X)]^2 = [f(X) - \hat{f}(X) + E(\epsilon)]^2 = [f(X) - \hat{f}(X)]^2$. But that further adds to my confusion about how we get $[f(X) - \hat{f}(X)]^2 + Var(\epsilon)$ as the answer. Where is the Var(\epsilon) coming from? I will edit the question to reflect this clarification. — wellington, Jul 31 '14 at 20:42
I was not pointing to a simplification, but to a *distinction*: the expectation of the square does not equal the square of the expectation. Even after the edits your question does not seem to recognize this crucial fact. — whuber, Aug 01 '14 at 01:01
The issue that I was having was the notation in the book. The way I was initially thinking of the problem, I was approaching it as $\mathrm{E}[(Y - \hat{Y})^2] = (\mathrm{E}[f(X) + \epsilon - \hat{f}(X)])^2$ i.e. quantity squared. What I later learned was, the book was trying to imply that $\mathrm{E}[f(X) + \epsilon - \hat{f}(X)]^2$ actually means $\mathrm{E}([f(X) + \epsilon - \hat{f}(X)]^2)$ I personally think this notation is a bit confusing, but it's how it's written in the text. I agree that it's important to remember that $\mathrm{E}[X^2] \neq \mathrm{E}[X]^2$ — wellington, Aug 01 '14 at 01:08
I thought I was the only one struggling with whether authors meant " expectation of square", or "square of expectation". I still don't. And I think this question as stated continues to use the original (ambiguous/unclear) notation... Which it should. I will look to the answers for clarity on what authors meant. — The Red Pea, Aug 10 '16 at 00:57

Glen_b · Accepted Answer · 2014-08-01T00:58:40.720

7

Simply expand the square ...

$$[f(X)- \hat{f}(X) + \epsilon ]^2=[f(X)- \hat{f}(X)]^2 +2 [f(X)- \hat{f}(X)]\epsilon+ \epsilon^2$$

... and use linearity of expectations:

$$\mathrm{E}[f(X)- \hat{f}(X) + \epsilon ]^2=E[f(X)- \hat{f}(X)]^2 +2 E[(f(X)- \hat{f}(X))\epsilon]+ E[\epsilon^2]$$

Can you do it from there? (What things remain to be shown?)

Hint in response to comments: Show $E(\epsilon^2)=\text{Var}(\epsilon)$

edited Aug 01 '14 at 00:58

answered Jul 31 '14 at 23:47

Glen_b

257,508
32
553
939

I actually was able to get that far in the time I've been trying at this problem since. One of the confusions that I had the first time around was that I was treating the entire term, $\mathrm{E}[...]$ to be squared, rather than just squaring the inside, i.e. $\mathrm{E}([...]^2)$. I understand why $\mathrm{E}[f(X) - \hat{f}(X)]^2$ becomes $[f(X) - \hat{f}(X)]^2$ since it is just a number, and the expected value of a real number is just the number. What I don't understand is how $2\mathrm{E}[f(X)-\hat{f}(X))\epsilon] + \mathrm{E}[\epsilon^2]$ becomes $\mathrm{Var}(\epsilon)$... – wellington Aug 01 '14 at 00:16
see my additional hint. What now remains to be shown? – Glen_b Aug 01 '14 at 00:30
Well we know that $\mathrm{E}(\epsilon^2) = \mathrm{Var}(\epsilon) + \mathrm{E}[\epsilon]^2$. The only thing I can think of is that we now apply the assumption that $\mathrm{E}[\epsilon] = 0$, therefore $(\mathrm{E}[\epsilon])^2 = 0$. Am I on the right track? – wellington Aug 01 '14 at 00:35
Yes, that's it. So what's left? And what's assumed about those quantities? – Glen_b Aug 01 '14 at 00:35
Well since $[f(X) - \hat{f}(X)]$ is a just a constant, we can also factor it out of the second term, i.e. make it $2[f(X) - \hat{f}(X)]\mathrm{E}[\epsilon]$ and since $\mathrm{E}[\epsilon] = 0$, the middle term becomes zero. Then the final term becomes $\mathrm{Var}(\epsilon) + (\mathrm{E}[\epsilon])^2$, which is the same as simply $\mathrm{Var}(\epsilon)$. Therefore the final result would be $[f(X) - \hat{f}(X)]^2 + \mathrm{Var}(\epsilon)$. Ahh...kicking myself! I was seriously overthinking it...thanks so much for the help! – wellington Aug 01 '14 at 00:52
I fixed a few typos in my mathematics. It looks like you're set now. Similar "expand the square, use linearity of expectation and simplify" approaches work on a variety of related problems, even under somewhat different assumptions. – Glen_b Aug 01 '14 at 00:59
@Glen_b: Why is it that $[f(X) - \hat{f}(X)]$ is a constant? Isn't it possible that $\hat{f}$ could differ from $f$ by varying amounts depending upon what value we are considering in their domain? – George Apr 11 '16 at 13:48
1

@George see the conditions in the question which tell us we're at a fixed value of $X$. – Glen_b Apr 11 '16 at 17:03
It's still unclear to me why $Var(\epsilon) = E[\epsilon]^2$. Expanding the definition of $Var$, I can see how $Var(\epsilon) = E[\epsilon^2]$, but why is it also equal to $E[\epsilon]^2$? – George Apr 12 '16 at 04:45
@George It isn't – Glen_b Apr 12 '16 at 04:50

score 0 · Answer 2 · edited Jul 08 '17 at 01:13

\begin{equation} \ E[(Y−\hat{Y})^2] = E[(f(X)+\epsilon-\hat{f}(X))^2] = E[(f(X)-\hat{f}(X))^2 + \epsilon^2 + 2\epsilon(f(X)-\hat{f}(X))] = E[(f(X)-\hat{f}(X))^2] + E[\epsilon^2] + E[2\epsilon(f(X)-\hat{f}(X))] = E[(f(X)-\hat{f}(X))^2] + E[\epsilon^2] + 2(f(X)-\hat{f}(X))*E[\epsilon].......(1)\\ \end{equation} The Last term is zero as the expected value of irreducible error is zero. And lets see where variance come from. In general: \begin{equation} \ Var(X) = E[(X−\bar{X})^2] = E[X^2 - 2X\bar{X} + \bar{X}^2] = E[X^2] - E[2X\bar{X}] + E[\bar{X}^2]\\ \end{equation} The mean of X is a constant and so is the square of the mean of X. Therefore equation becomes, \begin{equation} \ Var(X) = E[X^2] - 2\bar{X}*E[X] + \bar{X}^2 = E[X^2] - 2\bar{X}*\bar{X} + \bar{X}^2 = E[X^2] - 2\bar{X}^2 + \bar{X}^2 = E[X^2] - \bar{X}^2\\ Hence,\\Var(\epsilon) = E[\epsilon^2] - \bar{\epsilon}^2\\ \end{equation} But mean of $\epsilon$ is zero. So, \begin{equation} \\Var(\epsilon) = E[\epsilon^2].....(2) \\ \end{equation} Now taking equation 1, whose last term is zero & equation 2: \begin{equation} \ E[(Y−\hat{Y})^2] = E[(f(X)-\hat{f}(X))^2] + E[\epsilon^2] = E[(f(X)-\hat{f}(X))^2] + Var(\epsilon) \end{equation}

Proof/Derivation of Residual Sum of Squares (Based on Introduction to Statistical Learning)

2 Answers2

Linked

Related