3

We know that the best predictor of $\beta$ using the least squares criteria for linear regression is $\hat{\beta} = (X^TX)^{-1}X^Ty$ and I can derive this equation by minimizing the squared error in the training sample.

But when we minimize the expected prediction error, we get: $$ \begin{aligned} \frac{\partial EPE(\beta)}{\partial \beta} = &\iint\frac{\partial}{\partial \beta}(x^T\beta - y)^2dxdy = 0\\ \iff &\iint 2(x^T\beta - y)x\, dxdy = 0 \\ \iff &2\iint x^T\beta x - yx\, dxdy = 0 \\ \end{aligned} $$

Now $x^T\beta$ is a scalar, so $x^T\beta x = x(x^T\beta)$. So: $$ \begin{aligned} \iff &\iint x(x^T\beta) - yx\, dxdy = 0 \\ \iff &E(XX^T\beta) - E(YX) = 0 \\ \iff &\beta = (E(XX^T))^{-1}E(YX) \end{aligned} $$

I dont know how this expectation leads to $\hat{\beta}$. It doesn't seem to add up since $XX^T$ is a NxN matrix. What am I missing?

Thanks!


Edit: since it was pointed out that if $X$ is a random variable, the expectation has to be conditioned, we have:

$$ \begin{aligned} \frac{\partial EPE(\beta)}{\partial \beta} = &\iint\frac{\partial}{\partial \beta}(x^T\beta - y)^2 \Pr(x,y) = 0\\ \iff &\iint 2(x^T\beta - y)x \Pr(x,y) = 0 \\ \iff &2\iint [x^T\beta x - yx] \Pr(y|x) \Pr(x) = 0 \\ \iff &E_XE_{Y|X} [X(X^T\beta) - YX] = 0 \\ \iff &E_X(XX^T\beta) - E_XE_{Y|X}(YX) = 0 \\ \iff &\beta = (E_X(XX^T))^{-1}E_XE_{Y|X}(YX) \end{aligned} $$

Victor
  • 39
  • 3
  • 2
    $X$ is not random: either it is viewed as constant or else all expectations are conditional on it. – whuber Dec 28 '19 at 20:07
  • @whuber I added the solution with the conditional expectation. Does that make sense? I am still unsure how to get to $\hat{\beta}$ from there – Victor Dec 28 '19 at 23:08
  • You haven't done that consistently: for instance, you still take an unconditional expectation with $E_X(XX^\prime).$ Note, too, that linearity of expectation implies $E_{Y\mid X}(YX) = E(Y\mid X)X.$ – whuber Dec 29 '19 at 20:42
  • I was thinking and I don't believe we need to condition on $X$. We can solve it with the expectation over the joint distribution $\Pr(x,y)$. And then we get $\beta = (E(XX^T))^{-1}E(YX)$. My problem is relating this formula to $\hat{\beta} = (X^TX)^{-1}X^Ty$. Or how come $E(XX^T) = X^TX$ and $E(YX) = X^Ty$. Again these would be expectations over the joint distribution. – Victor Dec 29 '19 at 23:30
  • Not conditioning on $X$ would result in values that are useless in most applications: think about how little the result would tell you about predicting or estimating the response for any particular set of regressor values. – whuber Dec 30 '19 at 14:34

1 Answers1

4

Indeed, $X$ is not random as @whuber mentioned in the comments. Recall formulation of the Linear Regression problem:

$$ \begin{aligned} y_j =& \beta_1 x_{j1} + ... + \beta_n x_{jn} + \epsilon_j\\ \mathbf{y} =& X \boldsymbol{\beta} + \boldsymbol{\epsilon} \end{aligned} $$

where $\mathbb{E}\epsilon_j=0$. We are given $x_{ij}$ and want to estimate $\theta_j$. Note, that $x_{ij}$ are somewhat not random. Suppose, we are trying to figure out the properties of the device. We pass $x_{ji}$ as input (we know them precisely) and measure the output $y_j$ with some error $\epsilon_j$. So by definition of the problem:

$$ \mathbb{E}y_j = \beta_1 x_{j1} + ... + \beta_n x_{jn} $$

Hence

$$ \begin{aligned} \mathbb{E}(X^TX) = X^TX\\ \mathbb{E}(X^T\mathbf{y}) = X^T\mathbf{y}\\ \end{aligned} $$

Complete derivation may look like that

\begin{aligned} &\iint\frac{\partial}{\partial \beta}(X\hat{\boldsymbol{\beta}} - \mathbf{y})^2dxdy = 0\\ \iff &2\iint X^T(X\hat{\boldsymbol{\beta}} - \mathbf{y})\, dxdy = 0 \\ \iff &2\iint X^T X\hat{\boldsymbol{\beta}} - X^T\mathbf{y}\, dxdy = 0 \\ \iff &E(X^T X\hat{\boldsymbol{\beta}}) - E(X^T\mathbf{y}) = 0 \\ \iff &X^T X\hat{\boldsymbol{\beta}}- X^T\mathbf{y} = 0 \\ \iff &\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T\mathbf{y} \\ \end{aligned}

By the way, I am not sure that it actually makes sense to write integrals and take expectations, because we are dealing with constants. Probably it will be better to stick with something like that:

\begin{aligned} \ &(\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}}) \rightarrow \min\\ \Rightarrow\ &(\mathbf{y} - \hat{\mathbf{y}})^TX = 0\qquad\text{residual}\perp\mathbf{y}\text{'s manifold}\\ \Rightarrow\ &\mathbf{y}^TX - \hat{\boldsymbol{\beta}}^TX^TX = 0\\ \Rightarrow\ &\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T\mathbf{y} \end{aligned}


Edit. About taking expectation, it seems that you are right. But probably there is a small mistake connected with vector derivative. It should be:

$$ \begin{aligned} &2\iint (X^T X\hat{\boldsymbol{\beta}} - X^T\mathbf{y})\mathbb{P}(y|x)\mathbb{P}(x) dxdy = 0 \\ \iff &\mathbb{E}_X(X^TX\hat{\boldsymbol{\beta}}) - \mathbb{E}_X\mathbb{E}_{Y|X}(X^T\mathbf{y})=0 \end{aligned} $$

So

$$ \begin{aligned} \mathbb{E}(X^TX\hat{\boldsymbol{\beta}}) =& \mathbb{E}\sum_{i,j,k} x_{ji}x_{jk} \hat{b_k}\\ =& \sum_{i,j,k} \mathbb{E} x_{ji}x_{jk} \hat{b_k}\\ =& \sum_{k} \hat{b_k} \sum_{i,j} \mathbb{E} x_{ji}x_{jk}\\ =& (\mathbb{E} X^TX)\hat{\boldsymbol{\beta}} \end{aligned} $$

That is why

$$ \begin{aligned} &\mathbb{E}_X(X^TX\hat{\boldsymbol{\beta}}) - \mathbb{E}_X\mathbb{E}_{Y|X} (YX) = 0\\ \iff& \hat{\boldsymbol{\beta}} = (E_X(X^TX))^{-1}E_XE_{Y|X}(X^T\mathbf{y}) \end{aligned} $$

krashkov
  • 151
  • 2
  • Yes, I understand how to get to the $\hat{\beta}$ given a training sample and trying to minimize the squared error. That makes sense to me. What I was trying to do is calculate the EPE for $X, Y$ random variables and $f$ a linear function with shape $f(x) = x^T\beta$. But like @whuber said, this expectation would have to be conditioned on $X$. But now I am not so sure if it makes sense to do that, since the idea of using a linear function presumes that we have some knowledge about the data. – Victor Dec 28 '19 at 22:33
  • Since you are interested in stuff like that I can recommend reading this question https://stats.stackexchange.com/questions/204115/understanding-bias-variance-tradeoff-derivation?rq=1 and watching this video https://www.youtube.com/watch?v=zrEyxfl2-a8. Both might help you to clarify some concepts. – Jesper for President Dec 28 '19 at 22:56
  • I have a persuasion to believe, that one need to have some knowledge about the data to apply this method. Even if you decide to utilise nonlinear regression $y_j = f(z_j) + \epsilon_j$ (which in fact reduces to the linear one), then you need to make initial assumptions about $f$. – krashkov Dec 28 '19 at 23:03