33

I am reading the chapter on the bias-variance tradeoff in The elements of statistical learning and I don't understand the formula on page 29. Let the data arise from a model such that $$ Y = f(x)+\varepsilon$$ where $\varepsilon$ is random number with expected value $\hat{\varepsilon} = E[\epsilon]=0$ and Variance $E[(\varepsilon - \hat\varepsilon)^2]=E[\varepsilon^2]=\sigma^2$. Let the expected value of error of the model be $$ E[(Y-f_k(x))^2] $$ where $f_k(x)$ is the prediction of $x$ of our learner $k$. According to the book, the error is $$ \newcommand{\Bias}{\rm Bias} \newcommand{\Var}{\rm Var} E[(Y-f_k(x))^2]=\sigma^2+\Bias(f_k)^2+\Var(f_k(x)). $$

My question is: Why is the bias term not $0$? Developing the formula for the error I see: \begin{align} E[(Y-f_k(x))^2] &= \\ E[(f(x)+\varepsilon-f_k(x))^2] &= \\[8pt] E[(f(x)-f_k(x))^2] + \\ 2E[(f(x)-f_k(x))\varepsilon] + E[\varepsilon^2] &= \Var(f_k(x))+2E[(f(x)-f_k(x))\epsilon]+\sigma^2 \end{align}

as $\varepsilon$ is an independent random number $2E[(f(x)-f_k(x))\varepsilon] = 2E[(f(x)-f_k(x))]E[\varepsilon]=0$.

Where I am wrong?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
emanuele
  • 2,008
  • 3
  • 21
  • 34
  • I just want to make a comment here with regards to the formula in the book which in my opinion is very problematic mathematically speaking, $Err(x_0)=E[(Y-\hat{f}(x_0))^2 |X=x_0]$ The author refers (I guess he wants to at least) to $x_0$ as a random point but the notation $X=x_0$ is only for a deterministic value $x_0$ of the random variable X. In my opinion the only thing we can define here is $Err$ without the dependency on $x_0$ (imo it is a non-sence notation) simply as, $Err = E[(Y-\hat{f}(X))^2]$, where $\hat{f}$ is the calibrated model. – noob-mathematician Mar 22 '20 at 17:41

2 Answers2

31

A few more steps of the Bias - Variance decomposition

Indeed, the full derivation is rarely given in textbooks as it involves a lot of uninspiring algebra. Here is a more complete derivation using notation from the book "Elements of Statistical Learning" on page 223


If we assume that $Y = f(X) + \epsilon$ and $E[\epsilon] = 0$ and $Var(\epsilon) = \sigma^2_\epsilon$ then we can derive the expression for the expected prediction error of a regression fit $\hat f(X)$ at an input $X = x_0$ using squared error loss

$$Err(x_0) = E[ (Y - \hat f(x_0) )^2 | X = x_0]$$

For notational simplicity let $\hat f(x_0) = \hat f$, $f(x_0) = f$ and recall that $E[f] = f$ and $E[Y] = f$

\begin{aligned} E[ (Y - \hat f)^2 ] &= E[(Y - f + f - \hat f )^2] \\ & = E[(y - f)^2] + E[(f - \hat f)^2] + 2 E[(f - \hat f)(y - f)] \\ & = E[(f + \epsilon - f)^2] + E[(f - \hat f)^2] + 2E[fY - f^2 - \hat f Y + \hat f f] \\ & = E[\epsilon^2] + E[(f - \hat f)^2] + 2( f^2 - f^2 - f E[\hat f] + f E[\hat f] ) \\ & = \sigma^2_\epsilon + E[(f - \hat f)^2] + 0 \end{aligned}

For the term $E[(f - \hat f)^2]$ we can use a similar trick as above, adding and subtracting $E[\hat f]$ to get

\begin{aligned} E[(f - \hat f)^2] & = E[(f + E[\hat f] - E[\hat f] - \hat f)^2] \\ & = E \left[ f - E[\hat f] \right]^2 + E\left[ \hat f - E[ \hat f] \right]^2 \\ & = \left[ f - E[\hat f] \right]^2 + E\left[ \hat f - E[ \hat f] \right]^2 \\ & = Bias^2[\hat f] + Var[\hat f] \end{aligned}

Putting it together

$$E[ (Y - \hat f)^2 ] = \sigma^2_\epsilon + Bias^2[\hat f] + Var[\hat f] $$


Some comments on why $E[\hat f Y] = f E[\hat f]$

Taken from Alecos Papadopoulos here

Recall that $\hat f$ is the predictor we have constructed based on the $m$ data points $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}) \}$ so we can write $\hat f = \hat f_m$ to remember that.

On the other hand $Y$ is the prediction we are making on a new data point $(x^{(m+1)},y^{(m+1)})$ by using the model constructed on the $m$ data points above. So the Mean Squared Error can be written as

$$ E[\hat f_m(x^{(m+1)}) - y^{(m+1)}]^2$$

Expanding the equation from the previous section

$$E[\hat f_m Y]=E[\hat f_m (f+ \epsilon)]=E[\hat f_m f+\hat f_m \epsilon]=E[\hat f_m f]+E[\hat f_m \epsilon]$$

The last part of the equation can be viewed as

$$ E[\hat f_m(x^{(m+1)}) \cdot \epsilon^{(m+1)}] = 0$$

Since we make the following assumptions about the point $x^{(m+1)}$:

  • It was not used when constructing $\hat f_m$
  • It is independent of all other observations $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}) \}$
  • It is independent of $\epsilon^{(m+1)}$

Other sources with full derivations

Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72
  • 2
    Why $E[\hat{f}Y]=f E[\hat{f}]$? I don't think $Y$ and $\hat{f}$ are independent, since $\hat{f}$ is essentially constructed using $Y$. – Felipe Pérez Sep 10 '18 at 11:53
  • 6
    But the question is essentially the same, why $E[\hat{f}\epsilon]=0$? The randomness of $\hat{f}$ comes from the error $\epsilon$ so I don't see why would $\hat{f}$ and $\epsilon$ be independent, and hence, $\mathbb{E}(\hat{f}\epsilon)=0$. – Felipe Pérez Sep 10 '18 at 12:20
  • From your precisation seems that the in sample vs out of sample perspective is crucial. It's so? If we work only in sample and, then, see $\epsilon$ as residual the bias variance tradeoff disappear? – markowitz May 15 '19 at 07:58
  • 1
    @FelipePérez as far as I understand, the randomness of $\hat{f}$ comes from the train-test split (which points ended up in the training set and gave $\hat{f}$ as the trained predictor). In other words, the variance of $\hat{f}$ comes from all the possible subsets of a given fixed data-set that we can take as the training set. Because the data-set is fixed, there is no randomness coming from $\epsilon$ and therefore $\hat{f}$ and $\epsilon$ are independent. – Alberto Santini Jul 09 '19 at 23:17
28

You are not wrong, but you made an error in one step since $E[(f(x)-f_k(x))^2] \ne Var(f_k(x))$. $E[(f(x)-f_k(x))^2]$ is $\text{MSE}(f_k(x)) = Var(f_k(x)) + \text{Bias}^2(f_k(x))$.

\begin{align*} E[(Y-f_k(x))^2]& = E[(f(x)+\epsilon-f_k(x))^2] \\ &= E[(f(x)-f_k(x))^2]+2E[(f(x)-f_k(x))\epsilon]+E[\epsilon^2]\\ &= E\left[\left(f(x) - E(f_k(x)) + E(f_k(x))-f_k(x) \right)^2 \right] + 2E[(f(x)-f_k(x))\epsilon]+\sigma^2 \\ & = Var(f_k(x)) + \text{Bias}^2(f_k(x)) + \sigma^2. \end{align*}

Note: $E[(f_k(x)-E(f_k(x)))(f(x)-E(f_k(x))] = E[f_k(x)-E(f_k(x))](f(x)-E(f_k(x))) = 0.$

Greenparker
  • 14,131
  • 3
  • 36
  • 80
  • In case of binary outcomes, Is there an equivalent proof with cross entropy as error measure? – emanuele Mar 28 '16 at 15:22
  • 2
    It doesn't work out quite so well with a binary response. See Ex 7.2 in the second edition of "The Elements of Statistical Learning". – Matthew Drury Mar 28 '16 at 21:25
  • 8
    could you explain how you go from $E\left[\left(f(x) - E(f_k(x)) + E(f_k(x))-f_k(x) \right)^2 \right] + 2E[(f(x)-f_k(x))\epsilon]+\sigma^2$ to $Var(f_k(x)) + \text{Bias}^2(f_k(x)) + \sigma^2$? – Antoine Apr 30 '18 at 14:31