I have been trying to understand the proof of the bias/variance decomposition formula, and I came across a gap that I haven't been able to fill. I will use the notation of The Elements of Statistical Learning:
Suppose our data follows a model of the form $y = f(x)+\epsilon$, where $f$ is a deterministic function (let's say, from $\mathbb{R}^d$ to $\mathbb{R}$), and $\epsilon$ is a random function with zero mean and finite variance $\sigma^2$. We formulate our prediction model as $\hat{y}=\hat{f}(x)$, where $\hat{f}$ is a function that depends on the observed data. So, let's say we make $m$ measurements, $\{(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})\}$ and we construct our predictor $\hat{f}$ based on this data (for instance, the OLS where we can write the function $\hat{f}$ explicitly in terms of our dataset). Now, let's talk about bias and variance. Define the bias of $\hat{f}$ by
$$ \text{bias}(\hat{f}(x))=\mathbb{E}(\hat f(x)-f(x)). $$
Here $f$ is not random, so it doesn't really matter if we put it inside the expectation. Now similarly we define the variance by
$$ \newcommand{\var}{var} \var(\hat f (x)) = \mathbb{E}(\hat f(x)^2) - (\mathbb{E}(\hat f(x)))^2. $$
When trying to prove the formula for the decomposition of the mean squared error as the sum of the intrinsic error coming from the variance of $\epsilon$, plus the variance of $\hat f$ plus the square of the bias of $\hat f$, there is a step which is normally not very strongly justified: most of the sources I've seen say something along the lines of "$\mathbb{E}(\epsilon \hat f)=0$ since $\epsilon$ is independent of $\hat f$". But for me, this is not very clear, since $\hat f$ is constructed using the observed data, which comes with some noise from $\epsilon$. I've tried performing the calculations in a more formal way, trying to keep clear track of the expectations but I got a bit lost . How do you actually compute something like the bias of $\hat f$? More precisely, what is the space of parameters which you integrate over?
Thanks in advance!