Before I describe my question, it is necessary to note a common fact in estimation theory that MSE can be decomposed to Variance and Bias Squared. Depending on whether it is MSE of an estimator or MSE of a new observation to be predicted, the formula is slightly different (I personally find that the accepted answer in Relation between MSE and Bias-Variance is clear).
However, what I am asking is NOT the above two common MSE decompositions. Instead, it is the decomposition of the MSE of a test data set (after applying the prediction model trained from a training set) as follows,
(*) $E(e^2)=(E(e))^2+Var(e)$
Wherein $e$ denotes the error of a test data point, and the expectation is over different test data points.
My questions are:
(1) Does the decomposition of MSE in Equation (*) conceptually suggest any bias-variance trade-off?
- My own conceptual interpretation is that the bias part $(E(e))^2$ tells to what extent the prediction model systematically overestimates or underestimates the unknown true outcome ($\hat{y}$ being systematically larger or smaller than $y$). The variance part $Var(e)$ tells how “jumpy” the prediction error is to different test data points.
(2) Is this decomposition related to the MSE decomposition of a new observation to be predicted?
- I can only see the differences. The “expectation” of squared error in Equation (*) is over different test data points, whereas in the MSE decomposition of a new observation, the expectation of squared error is over different training data samples. That is, they have different definitions of MSE. Consequently, to show the decomposition in (*), I can simply calculate the squared mean error and variance of error over many test data points; while for the latter, bootstrap is often needed to generate many training data samples (as bias_variance_decomp.py in python package mlxtend did) .
Perhaps the decomposition of MSE of a test data set looks a bit odd, but I really want to figure out if it makes sense. Any insight would be much appreciated.