Predictor and error are independent

Question

I have been trying to understand the proof of the bias/variance decomposition formula, and I came across a gap that I haven't been able to fill. I will use the notation of The Elements of Statistical Learning:

Suppose our data follows a model of the form $y = f(x)+\epsilon$, where $f$ is a deterministic function (let's say, from $\mathbb{R}^d$ to $\mathbb{R}$), and $\epsilon$ is a random function with zero mean and finite variance $\sigma^2$. We formulate our prediction model as $\hat{y}=\hat{f}(x)$, where $\hat{f}$ is a function that depends on the observed data. So, let's say we make $m$ measurements, $\{(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})\}$ and we construct our predictor $\hat{f}$ based on this data (for instance, the OLS where we can write the function $\hat{f}$ explicitly in terms of our dataset). Now, let's talk about bias and variance. Define the bias of $\hat{f}$ by

$$ \text{bias}(\hat{f}(x))=\mathbb{E}(\hat f(x)-f(x)). $$

Here $f$ is not random, so it doesn't really matter if we put it inside the expectation. Now similarly we define the variance by

$$ \newcommand{\var}{var} \var(\hat f (x)) = \mathbb{E}(\hat f(x)^2) - (\mathbb{E}(\hat f(x)))^2. $$

When trying to prove the formula for the decomposition of the mean squared error as the sum of the intrinsic error coming from the variance of $\epsilon$, plus the variance of $\hat f$ plus the square of the bias of $\hat f$, there is a step which is normally not very strongly justified: most of the sources I've seen say something along the lines of "$\mathbb{E}(\epsilon \hat f)=0$ since $\epsilon$ is independent of $\hat f$". But for me, this is not very clear, since $\hat f$ is constructed using the observed data, which comes with some noise from $\epsilon$. I've tried performing the calculations in a more formal way, trying to keep clear track of the expectations but I got a bit lost . How do you actually compute something like the bias of $\hat f$? More precisely, what is the space of parameters which you integrate over?

Thanks in advance!

The only explanation that I can think of is that when computing $\mathbb{E}(y-\hat f)^2$, the error term $\epsilon$ appearing in $y$ is independent of the error term appearing in $\hat f$ as they correspond to different measurements, and we can assume that such errors are indeed independent. — Felipe Pérez, Sep 10 '18 at 21:59

score 3 · Accepted Answer · answered Sep 11 '18 at 11:56

Subscripts and clear exposition of dependencies matter. We investigate the prediction error. We construct our predictor based on data $\{(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})\}$, so write $\hat f = \hat f_m$ to remember that. We then consider predicting

$$y^{(m+1)} = f[x^{(m+1)}]+ \epsilon^{(m+1)}$$

based on $x^{(m+1)}$.

The mean squared error of the prediction here is

$$E\Big[\hat f_m[x^{(m+1)}] - y^{(m+1)}\Big]^2$$

and the troubling expression after manipulations and decompositions is

$$E\Big[\hat f_m[x^{(m+1)}] \cdot \epsilon^{(m+1)}\Big]$$

But $\epsilon^{(m+1)}$ has not directly participated in constructing $\hat f_m()$ because for this we only used data up to $m$. Under the additional but usually made assumptions that

a) $x^{(m+1)}$ is independent of $\epsilon^{(m+1)}$ and b) that observations are independent,

we get the usual result that appears in the literature.

The assumption "a" seem natural but the "b" is not too strict in time series ? The prediction problem is not of primary interest exactly in it? — markowitz, May 16 '19 at 10:39
@markowitz Certainly, but here we deal with the "traditional", "predicting from regression" approach. The purpose of my post was to clarify the underlying assumptions to the OP, not to argue for the realism of this prediction model. — Alecos Papadopoulos, May 16 '19 at 17:00

Predictor and error are independent

1 Answers1

Linked