We know that the best predictor of $\beta$ using the least squares criteria for linear regression is $\hat{\beta} = (X^TX)^{-1}X^Ty$ and I can derive this equation by minimizing the squared error in the training sample.
But when we minimize the expected prediction error, we get: $$ \begin{aligned} \frac{\partial EPE(\beta)}{\partial \beta} = &\iint\frac{\partial}{\partial \beta}(x^T\beta - y)^2dxdy = 0\\ \iff &\iint 2(x^T\beta - y)x\, dxdy = 0 \\ \iff &2\iint x^T\beta x - yx\, dxdy = 0 \\ \end{aligned} $$
Now $x^T\beta$ is a scalar, so $x^T\beta x = x(x^T\beta)$. So: $$ \begin{aligned} \iff &\iint x(x^T\beta) - yx\, dxdy = 0 \\ \iff &E(XX^T\beta) - E(YX) = 0 \\ \iff &\beta = (E(XX^T))^{-1}E(YX) \end{aligned} $$
I dont know how this expectation leads to $\hat{\beta}$. It doesn't seem to add up since $XX^T$ is a NxN matrix. What am I missing?
Thanks!
Edit: since it was pointed out that if $X$ is a random variable, the expectation has to be conditioned, we have:
$$ \begin{aligned} \frac{\partial EPE(\beta)}{\partial \beta} = &\iint\frac{\partial}{\partial \beta}(x^T\beta - y)^2 \Pr(x,y) = 0\\ \iff &\iint 2(x^T\beta - y)x \Pr(x,y) = 0 \\ \iff &2\iint [x^T\beta x - yx] \Pr(y|x) \Pr(x) = 0 \\ \iff &E_XE_{Y|X} [X(X^T\beta) - YX] = 0 \\ \iff &E_X(XX^T\beta) - E_XE_{Y|X}(YX) = 0 \\ \iff &\beta = (E_X(XX^T))^{-1}E_XE_{Y|X}(YX) \end{aligned} $$