Bias-Variance decomposition derivation

Question

My stats professor assigned this problem: "Show that the expected prediction error (EPE) for a squared error loss when $Y=f(X)+\varepsilon$ with estimator $\hat{f}(x)$ assuming $X=x$ is fixed and $\varepsilon~(0,\sigma^2)$ can be written as a combination of the bias and variance. In other words, show that

$$EPE(x) = E[(Y-\hat{f}(x))^2] = \sigma^2 + \text{Bias}^2 + \text{Var}(\hat{f}(x)).$$

I've come up with 3 ways to derive this relationship, but all of them depend on the assumption that $Y=f(X)+\varepsilon$ and $\hat{f}(X)$ are independent, or at least that their covariance is zero. For example, this allows me to use $\text{Var}[Y-\hat{f}(X)] = \text{Var}(Y)+\text{Var}(\hat{f}(X))$. This assumption makes intuitive sense to me because there is no necessary connection between $Y$ and the estimated value $\hat{f}(X)$ (the estimator could be a random number generator, after all). But I'm struggling to justify the independence assumption in a rigorous way. Can somebody nudge me toward understanding?

jld · Accepted Answer · 2016-09-10T00:05:51.347

5

Here is a hint: consider $Y - \hat f = (Y - f) + (f - \hat f)$, and remember that $E(Y-f)=0$ and that $f$ is not random. Also, as @GeoMatt22 pointed out, you'll need $Cov(\varepsilon_0, \hat f) = 0$, which we get by virtue of iid errors.

(Basically I think you're probably making this more complicated than it needs to be, and it really just boils down to my hint)

Regarding whether or not $\hat f \perp Y$, generally our predictions are not just functions of $X$ but also of $Y$ so they can't be independent. In linear regression, for example, our fitted values $\hat Y = X(X^T X)^{-1}X^T Y$ so certainly it is not the case that $\hat Y \perp Y$ in general.

Update

I think the issue is that we've both been a little careless with what '$\varepsilon$' is. We observed data $(\bf y, \bf X)$ where in our data $y_i = f(x_i) + \varepsilon_i$, so that $\hat f$ is a function of $\bf y$, $\bf X$, and $\varepsilon_i$ for $i = 1, \dots, n$. We now observe a new point $(y_0, x_0)$ where we assume that $y_0 = f(x_0) + \varepsilon_0$. This is the key: this new point has its own error $\varepsilon_0$ that is independent of everything that went into $\hat f$ by the usual assumption of iid errors. So for $i = 1, \dots, n$ it definitely is not the case that $\varepsilon_i \perp \hat f$; but the error for a new point is indeed uncorrelated.

edited Sep 10 '16 at 00:05

answered Sep 09 '16 at 21:26

jld

18,405
2
52
65

I think the relevant assumption is that $\epsilon$ and $\hat{f}$ have zero covariance? – GeoMatt22 Sep 09 '16 at 21:40
@GeoMatt22 good point. I've updated. – jld Sep 09 '16 at 21:50
Thanks. In one of my derivations I have $V(Y-\hat{f})=V(Y-f+f-\hat{f})=V(\varepsilon+f-\hat{f})=V(\varepsilon)+V(f)+V(\hat{f})=\sigma^2+0+V(\hat{f})$, which is what I want. But it seems that this argument rests on $\varepsilon$, $f$ and $\hat{f}$ having zero pairwise covariance, which gets back to my original problem. Am I missing something? – gasbag_1 Sep 09 '16 at 21:51
$f$ is constant and you can show that a constant is uncorrelated with anything (proof: $Cov(X, a) = E(aX) - E(a)E(X) = a(EX-EX)=0$), so you really just need that $\varepsilon$ and $\hat f$ are uncorrelated – jld Sep 09 '16 at 21:53
Okay, that makes sense. So $\text{Cov}(\varepsilon,\hat{f}) = E(\varepsilon \hat{f})-E(\varepsilon)E(\hat{f}) = E(\varepsilon \hat{f})$. At this point I'm stuck. If $\hat{f}$ is a function of $Y$ and $Y=f(X)+\varepsilon$, then $\hat{f}$ and $\varepsilon$ are not independent. I would like to say that $E(\varepsilon \hat{f}) = \hat{f}E(\varepsilon) = 0$, but I don't see how to justify that move. Or must we simply assume that $\varepsilon$ and $\hat{f}$ are uncorrelated? – gasbag_1 Sep 09 '16 at 23:12
@Lawrence303 I've made an update that i think answers your question – jld Sep 10 '16 at 00:01

Bias-Variance decomposition derivation

1 Answers1

Linked

Related