What's the MSE of $\hat{Y}$ in ordinary least squares using bias-variance decomposition?

Question

Suppose I have the following model: $$Y = \mu + \epsilon = X\beta + \epsilon,$$ where $Y$ is $n \times 1$, $X$ is $n \times p$, $\beta$ is $p \times 1$, and $\epsilon$ is $n \times 1$. I assume that $\epsilon$ are independent with mean 0 and variance $\sigma^2I$.

In OLS, the fitted values are $\hat{Y} = HY$, where $H = X(X^TX)^{-1}X^T$ is the $N \times N$ hat matrix. I want to find the MSE of $\hat{Y}$.

By the bias-variance decomposition, I know that

\begin{align*} MSE(\hat{Y}) &= bias^2(\hat{Y}) + var(\hat{Y})\\ &= (E[HY] - \mu)^T(E[HY] - \mu) + var(HY)\\ &= (H\mu - \mu)^T(H\mu - \mu) + \sigma^2H\\ &= 0 + \sigma^2H \end{align*}

I'm confused by the dimension in the last step. The $bias^2$ term is a scalar. However, $var(\hat{Y})$ is an $N \times N$ matrix. How can one add a scalar to an $N \times N$ matrix where $N \neq 1$?

Here is a good blog to explain differences, including in a machine learning context at https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/ . — AJKOER, Jul 15 '20 at 00:40
On the additive bias term, a matrix can have all its elements equal to zero. — AJKOER, Jul 15 '20 at 00:50
@AJKOER so are you implying that $(H\mu - \mu)^T(H\mu - \mu)$ is a matrix of all 0's? — Adrian, Jul 15 '20 at 00:54
Your dimensions don't work out, because on the second line of your MSE calculation the dimensions suddenly change. — whuber, Jul 17 '20 at 15:51
@whuber, yes, it seems like starting from the second line the dimensions already do not match. How would you suggest I write the starting equation? or the second line to get the dimensions right from the get go — Adrian, Jul 17 '20 at 16:19
It depends on what you mean by "MSE." The usual meanings are scalar in nature, but you seem to want it to be an $n\times n$ matrix. — whuber, Jul 17 '20 at 17:55
@whuber I am familiar with the MSE decomposition in the scalar case. I just wanted to write things out in matrix notation. But even then I think the final MSE should still be a scalar, not a matrix. Is there a way to write the MSE using matrix notation of $Y$, and $X$, to get a scalar at the end, I may need to take the trace of something to get a scalar?... — Adrian, Jul 18 '20 at 15:29
Isn't the error vector $e = \hat Y - X\beta$? In that case the mean square error is $|e|^2/n$ or, if you prefer, $e^\prime e/n.$ — whuber, Jul 18 '20 at 16:19
@whuber that makes sense. do you think it's possible to do a bias-variance decomposition of the MSE in this case? — Adrian, Jul 18 '20 at 17:23
Sure, because the bias of the MSE is well-known and easy to compute. — whuber, Jul 18 '20 at 17:34

dwolfeu · Accepted Answer · 2021-04-01T12:30:34.210

More explanation in the edit below

I think the confusion arises because of the two different meanings of the MSE:

A value calculated from a sample of fitted values or predictions; this is usually what we mean when we write $\operatorname{MSE}(\hat{Y})$ in the context of OLS, since $\hat{Y}$ is the vector of fitted values.
A value calculated from an estimator. It is this meaning where we have the variance–bias decomposition. We use this meaning of the MSE in the context of OLS too, but usually for the MSE of $\hat{\beta}$, where $\hat{\beta}$ is an estimator of the parameter $\beta$. By the Gauss–Markov theorem we know that $\operatorname{Bias}(\hat{\beta}) = 0$ and thus $\operatorname{MSE}(\hat{\beta}) = \operatorname{Var}(\hat{\beta})$ by the variance–bias decomposition if we take $\hat{\beta} = (X^TX)^{-1}X^TY$.

However, we can view $\hat{Y}$ as an estimator of $X\beta$ and thus consider $\operatorname{MSE}(\hat{Y})$ in the second sense. This is really just a rephrasing of the usual OLS estimator of $\beta$: In the normal setup we estimate the parameter $\beta$ given $X$ and $Y$, while in this new setup we estimate the parameter $X\beta$ given $X$ and $Y$. Alas the notation is now confusing, since the notation $\hat{Y}$ suggests that we are estimating $Y$ (a random variable), which we are not doing.

To simplify the formalism, we will use the notation of the OP and define $\mu = X\beta$. Don't confuse this with a mean!

We also have to clarify some definitions, since we are now dealing with a vector-valued estimator. First the variance (see this answer for some explanation):

\begin{equation*} \operatorname{Var}(\hat{Y}) = \operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)^T\right] \end{equation*}

The definition of the bias doesn't change from the 1-dimensional case:

\begin{equation*} \operatorname{Bias}(\hat{Y}) = \operatorname{E}[\hat{Y}]-\mu \end{equation*}

However, we do have to find a vector-valued equivalent of the 1-dimensional expression $\operatorname{Bias}_\mu(\hat{Y})^2$, since this appears in the variance–bias decomposition. In the same vein as the vector-valued variance, this equivalent expression is the following:

\begin{equation*} \operatorname{Bias}(\hat{Y})\operatorname{Bias}(\hat{Y})^T \end{equation*}

Note that $\operatorname{Bias}(\hat{Y})$ is a fixed vector, so if the expression $\operatorname{E}[\hat{Y}]-\mu$ appears within the scope of an expected-value operator, we can take it out as a constant. This question is about this fact, albeit for the 1-dimensional case.

And finally the MSE itself:

\begin{equation*} \operatorname{MSE}(\hat{Y}) = \operatorname{E} \left [\left(\hat{Y}-\mu\right)\left(\hat{Y}-\mu\right)^T \right] \end{equation*}

So, with all this in hand, we can now prove the variance–bias decomposition of the MSE for a vector-valued estimator, which is really just a rephrasing of the usual proof for the 1-dimensional case:

\begin{align*} \operatorname{MSE}(\hat{Y}) &= \operatorname{E} \left [\left(\hat{Y}-\mu\right)\left(\hat{Y}-\mu\right)^T \right] \\ &= \operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}]+\operatorname{E}[\hat{Y}]-\mu\right)\left(\hat{Y}-\operatorname{E}[\hat{Y}]+\operatorname{E}[\hat{Y}]-\mu\right)^T\right]\\ &= \operatorname{E}\left[\left(\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)+\left(\operatorname{E}[\hat{Y}]-\mu\right)\right)\left(\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)^T+\left(\operatorname{E}[\hat{Y}]-\mu\right)^T\right)\right]\\ &= \operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)^T +\left(\hat{Y}-\operatorname{E}[\hat{Y}] \right ) \left (\operatorname{E}[\hat{Y}]-\mu \right)^T\right. \\ &\hphantom{xxxxxxxxxx} + \left.\left(\operatorname{E}[\hat{Y}]-\mu \right)\left(\hat{Y}-\operatorname{E}[\hat{Y}] \right)^T +\left( \operatorname{E}[\hat{Y}]-\mu \right)\left( \operatorname{E}[\hat{Y}]-\mu \right)^T\right] \\ &= \operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)^T\right] + \operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}] \right ) \left (\operatorname{E}[\hat{Y}]-\mu \right)^T\right] \\ &\hphantom{xxxxxxxxxx} + \operatorname{E}\left[\left (\operatorname{E}[\hat{Y}]-\mu \right)\left(\hat{Y}-\operatorname{E}[\hat{Y}] \right)^T\right] + \operatorname{E}\left[\left( \operatorname{E}[\hat{Y}]-\mu \right)\left( \operatorname{E}[\hat{Y}]-\mu \right)^T\right] \\ &=\operatorname{Var}(\hat{Y}) + \operatorname{E}\left[\hat{Y}-\operatorname{E}[\hat{Y}] \right]\left(\operatorname{E}[\hat{Y}]-\mu \right)^T \\ &\hphantom{xxxxxxxxxx} + \left (\operatorname{E}[\hat{Y}]-\mu \right)\operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}] \right)^T\right]+ \left( \operatorname{E}[\hat{Y}]-\mu \right)\left( \operatorname{E}[\hat{Y}]-\mu \right)^T \hphantom{xx} (*)\\ &=\operatorname{Var}(\hat{Y}) + \left(\operatorname{E}[\hat{Y}]-\operatorname{E}[\hat{Y}] \right)\left(\operatorname{E}[\hat{Y}]-\mu \right)^T \\ & \hphantom{xxxxxxxxxx}+ \left (\operatorname{E}[\hat{Y}]-\mu \right)\left(\operatorname{E}[\hat{Y}]-\operatorname{E}[\hat{Y}] \right)^T + \operatorname{Bias}(\hat{Y})\operatorname{Bias}(\hat{Y})^T \\ &=\operatorname{Var}(\hat{Y}) + 0\left(\operatorname{E}[\hat{Y}]-\mu \right)^T + \left (\operatorname{E}[\hat{Y}]-\mu \right)0^T + \operatorname{Bias}(\hat{Y})\operatorname{Bias}(\hat{Y})^T \\ &= \operatorname{Var}(\hat{Y}) + \operatorname{Bias}(\hat{Y})\operatorname{Bias}(\hat{Y})^T \end{align*}

Let's now actually calculate the bias and the variance of the estimator $\hat{Y}$:

\begin{align*} \operatorname{Bias}(\hat{Y}) &= \operatorname{E}[\hat{Y}]-\mu \\ &= \operatorname{E}[\hat{Y}-\mu] \\ &= \operatorname{E}\left[X(X^TX)^{-1}X^TY-X\beta\right] \\ &= \operatorname{E}\left[X\left((X^TX)^{-1}X^TY-\beta\right)\right] \\ &= X\operatorname{E}\left[(X^TX)^{-1}X^TY-\beta\right] \\ &= X\operatorname{E}[\hat{\beta}-\beta] \\ &= X0 \\ &= 0 \end{align*}

The equality $\operatorname{E}[\hat{\beta}-\beta]=0$ is a consequence of the Gauss–Markov theorem. Note that $\operatorname{Bias}(\hat{Y})=0$ implies that $\operatorname{E}[\hat{Y}]=\mu$ by simple rearrangement.

We now calculate the variance:

\begin{align*} \operatorname{Var}(\hat{Y}) &= \operatorname{E}\left[(\hat{Y}-\operatorname{E}[\hat{Y}])(\hat{Y}-\operatorname{E}[\hat{Y}])^T\right]\\ &= \operatorname{E}\left[(\hat{Y}-\mu)(\hat{Y}-\mu)^T\right]\\ &= \operatorname{E}\left[(X\hat{\beta}-X\beta)(X\hat{\beta}-X\beta)^T\right]\\ &= \operatorname{E}\left[X(\hat{\beta}-\beta)(\hat{\beta}-\beta)^TX^T\right]\\ &= X\operatorname{E}\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T\right]X^T\\ &= X\operatorname{E}\left[(\hat{\beta}-\operatorname{E}[\hat{\beta}])(\hat{\beta}-\operatorname{E}[\hat{\beta}])^T\right]X^T \hphantom{xx} (\text{by the Gauss–Markow theorem})\\ &= X\operatorname{Var}(\hat{\beta})X^T\\ &= X(\sigma^2(X^TX)^{-1}X^T) \hphantom{xx} (**)\\ &= X(\sigma^2(X^TX)^{-1}X^T)\\ &= \sigma^2X(X^TX)^{-1}X^T\\ &= \sigma^2H \end{align*}

We prove the step marked $(**)$, namely that $\operatorname{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}$:

\begin{align*} \hat{\beta} - \beta &= (X^TX)^{-1}X^TY - \beta \\ &= (X^TX)^{-1}X^T(X\beta + \epsilon) - \beta \\ &= (X^TX)^{-1}X^TX\beta + (X^TX)^{-1}X^T\epsilon - \beta \\ &= \beta + (X^TX)^{-1}X^T\epsilon - \beta \\ &= (X^TX)^{-1}X^T\epsilon \end{align*}

Thus:

\begin{align*} \operatorname{Var}(\hat{\beta}) &=\operatorname{E}\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^T\right] \\ &= \operatorname{E}\left[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T\epsilon)^T\right] \\ &= \operatorname{E}\left[(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-T}\right] \\ &= (X^TX)^{-1}X^T\operatorname{E}\left[\epsilon\epsilon^T\right]X(X^TX)^{-T} \\ &= (X^TX)^{-1}X^T\operatorname{E}\left[(\epsilon-0)(\epsilon-0)^T\right]X(X^TX)^{-T} \\ &= (X^TX)^{-1}X^T\operatorname{E}\left[(\epsilon-\operatorname{E}[\epsilon])(\epsilon-\operatorname{E}[\epsilon])^T\right]X(X^TX)^{-T} \\ &= (X^TX)^{-1}X^T\operatorname{Var}(\epsilon)X(X^TX)^{-T} \\ &= (X^TX)^{-1}X^T(\sigma^2I)X(X^TX)^{-T} \hphantom{xx} (\text{since the errors are uncorrelated with each other})\\ &= (X^TX)^{-1}X^T(\sigma^2I)X(X^TX)^{-T} \\ &= \sigma^2(X^TX)^{-1}X^TX(X^TX)^{-T} \\ &= \sigma^2(X^TX)^{-T} \\ &= \sigma^2((X^TX)^T)^{-1} \\ &= \sigma^2(X^TX)^{-1} \\ \end{align*}

So, putting it all together:

\begin{align*} \operatorname{MSE}(\hat{Y}) &= \operatorname{Var}(\hat{Y}) + \operatorname{Bias}(\hat{Y})\operatorname{Bias}(\hat{Y})^T \\ &= \sigma^2H + 00^T \\ &= \sigma^2H \end{align*}

This is the answer that the OP calculated. :)

EDIT

The OP asked in the comments why we define

\begin{equation*} \operatorname{MSE}(\hat{Y}) = \operatorname{E} \left [\left(\hat{Y}-\mu\right)\left(\hat{Y}-\mu\right)^T \right] \end{equation*}

and not

\begin{equation*} \operatorname{MSE}(\hat{Y}) = \operatorname{E} \left[\left(\hat{Y}-\mu\right)^T\left(\hat{Y}-\mu\right) \right]. \end{equation*}

This is a good question; indeed, it's the crux of the OP's original question and I didn't address it properly. I will attempt to redress this oversight.

In the 1-dimensional case, the meaning of the definition

\begin{equation*} \operatorname{MSE}(\hat{Y}) = \operatorname{E}\left[\left(\hat{Y}-\mu\right)^2\right] \end{equation*}

is unambiguous. But if $\hat{Y}-\mu$ is a vector, then we have to decide how to interpret the expression $\left(\hat{Y}-\mu\right)^2$. We have two options:

$\left(\hat{Y}-\mu\right)^2 = \left(\hat{Y}-\mu\right)^T\left(\hat{Y}-\mu\right)$
$\left(\hat{Y}-\mu\right)^2 = \left(\hat{Y}-\mu\right)\left(\hat{Y}-\mu\right)^T$

In my original answer I went with the second option (based on arguments given here). But what about the first option? Well, we still have the variance–bias decomposition! Let's show that. We start by defining all the relevant terms; I mark them with a superscript asterisk * in order to distinguish them from the definitions given in my original answer, but please note that this is not standard notation:

\begin{align*} \operatorname{MSE}^*(\hat{Y}) &= \operatorname{E}\left[\left(\hat{Y}-\mu\right)^T\left(\hat{Y}-\mu\right) \right] \\ \operatorname{Var}^*(\hat{Y}) &= \operatorname{E}\left[\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)^T\left(\hat{Y}-\operatorname{E}[\hat{Y}]\right)\right] \\ \operatorname{Bias}^*(\hat{Y}) &= \operatorname{E}[\hat{Y}]-\mu \left(= \operatorname{Bias}(\hat{Y}) \right)\\ \operatorname{Bias}^*(\hat{Y})^2 &= \operatorname{Bias}^*(\hat{Y})^T\operatorname{Bias}^*(\hat{Y}) \end{align*}

(Note that we could multiply by the constant factor $\frac{1}{n}$, i.e. define

\begin{equation*} \operatorname{MSE}^*(\hat{Y}) = \operatorname{E}\left[\tfrac{1}{n}\left(\hat{Y}-\mu\right)^T\left(\hat{Y}-\mu\right) \right]. \end{equation*}

It doesn't really matter whether we include this constant factor, since it has no effect on the expected-value operator.)

With these definitions, the MSE still decomposes into the sum of the variance and the square of the bias:

\begin{equation*} \operatorname{MSE}^*(\hat{Y}) = \operatorname{Var}^*(\hat{Y}) + \operatorname{Bias}^*(\hat{Y})^2 \end{equation*}

The proof is all but identical to the one given above: One just has to move a few superscript $T$s around.

What the OP did in their original calculation was to mix up the different definitions when they applied the variance–bias decomposition: They used $\operatorname{Var}^*(\hat{Y})$ but $\operatorname{Bias}(\hat{Y})\operatorname{Bias}(\hat{Y})^T$. This is why the dimensions didn't match.

Why is the MSE defined as $\operatorname{MSE}_\mu(\hat{Y}) = \operatorname{E}_\mu \left [\left(\hat{Y}-\mu\right)\left(\hat{Y}-\mu\right)^T \right]$ and not $\operatorname{MSE}_\mu(\hat{Y}) = \operatorname{E}_\mu \left [\left(\hat{Y}-\mu\right)^T\left(\hat{Y}-\mu\right)\right]$ instead? — Adrian, Jul 25 '20 at 20:21
Thanks! One last question: Is $\operatorname{Var}^*_\mu(\hat{Y}) = \operatorname{E}_\mu\left[\left(\hat{Y}-\operatorname{E}_\mu[\hat{Y}]\right)^T\left(\hat{Y}-\operatorname{E}_\mu[\hat{Y}]\right)\right]$ equivalent to $trace\left(\operatorname{E}_\mu\left[\left(\hat{Y}-\operatorname{E}_\mu[\hat{Y}]\right)\left(\hat{Y}-\operatorname{E}_\mu[\hat{Y}]\right)^T\right]\right)$? Typically when I think of variance (or covariance) of a vector like $\hat{Y}$ I think it's a matrix, thus the outerproduct. — Adrian, Jul 26 '20 at 19:09
Yes. This is a consequence of the additivity of the expected-value operator and the [general fact](https://en.wikipedia.org/wiki/Trace_(linear_algebra)#Trace_of_a_product) that $\operatorname{tr}(vv^T) = v^Tv$ for any vector $v$. — dwolfeu, Jul 27 '20 at 05:09

What's the MSE of $\hat{Y}$ in ordinary least squares using bias-variance decomposition?

1 Answers1

Related