An Implementation of MSE decomposition to Variance and Bias Squared

Question

Before I describe my question, it is necessary to note a common fact in estimation theory that MSE can be decomposed to Variance and Bias Squared. Depending on whether it is MSE of an estimator or MSE of a new observation to be predicted, the formula is slightly different (I personally find that the accepted answer in Relation between MSE and Bias-Variance is clear).

However, what I am asking is NOT the above two common MSE decompositions. Instead, it is the decomposition of the MSE of a test data set (after applying the prediction model trained from a training set) as follows,

(*) $E(e^2)=(E(e))^2+Var(e)$

Wherein $e$ denotes the error of a test data point, and the expectation is over different test data points.

My questions are:

(1) Does the decomposition of MSE in Equation (*) conceptually suggest any bias-variance trade-off?

My own conceptual interpretation is that the bias part $(E(e))^2$ tells to what extent the prediction model systematically overestimates or underestimates the unknown true outcome ($\hat{y}$ being systematically larger or smaller than $y$). The variance part $Var(e)$ tells how “jumpy” the prediction error is to different test data points.

(2) Is this decomposition related to the MSE decomposition of a new observation to be predicted?

I can only see the differences. The “expectation” of squared error in Equation (*) is over different test data points, whereas in the MSE decomposition of a new observation, the expectation of squared error is over different training data samples. That is, they have different definitions of MSE. Consequently, to show the decomposition in (*), I can simply calculate the squared mean error and variance of error over many test data points; while for the latter, bootstrap is often needed to generate many training data samples (as bias_variance_decomp.py in python package mlxtend did) .

Perhaps the decomposition of MSE of a test data set looks a bit odd, but I really want to figure out if it makes sense. Any insight would be much appreciated.

Your (1) is not correct. Try something either empirically or analytically. Suppose you have a random variable with an $N(\mu, \sigma^2)$ distribution, and you want to estimate $\sigma^2$ from $n$ observations using one of $\frac1{n-1}\sum(x_i-\bar x)^2$ or $\frac1{n}\sum(x_i-\bar x)^2$ or $\frac1{n+1}\sum(x_i-\bar x)^2$. You will find these have different biases and different variances and different expected squared errors; the first has no bias while the last has the lowest variance simply because it is smaller. Guess and then check which has least expected squared error — Henry, Aug 12 '21 at 22:06
Hi @Henry Thank you for your examples. I guess the last estimator, with sum divided by $n+1$, has least expected squared error. Actually it has the smallest variance and largest bias among the three estimators. But I agree with you that the trade-off mentioned in question (1) does not necessarily exist since larger bias may not be associated with smaller variance. Perhaps we can only observe such trade-off under some contexts (say for machine learning model, we have the decomposition of MSE of a prediction model, and often models with smaller bias are more complex thus have larger variance). — zhenli, Aug 13 '21 at 03:08

An Implementation of MSE decomposition to Variance and Bias Squared

0 Answers0