What is the trade-off in the bias-variance trade-off?

Question

Let $\theta$ be a parameter and $\hat{\theta}$ be an estimator for $\theta$.

I understand that the MSE of $\hat{\theta}$ can be decomposed into its bias and variance. That makes sense. What I don't see is where there is a trade off. A trade-off would suggest that all estimators have the same MSE and you can only get less variance by incurring higher bias and vice versa.

Can you please explain the mathematics behind it?

It means that *in general* when one reduces the variance involved in estimation the bias increases. It isn't true in all cases, you can construct estimators that dominate others in terms of bias and variance (take for instance the silly estimator $\hat{\theta} + \epsilon$ where $\epsilon \sim$ normal$(\mu, \sigma^2)$ with $\mu \neq 0$ and $\hat{\theta}$ is an unbiased estimator). — dsaxton, Mar 04 '16 at 19:54
@whuber, this isn't a duplicate of that other question. I was asking for mathematical justification for the bias-variance trade-off, and it seems that dsaxton is suggesting it doesn't exist, it's more a rule of thumb — , Mar 04 '16 at 20:26
I saw arguments in the duplicate that look like "mathematical justification." Could you explain what exactly you mean by this term, then? — whuber, Mar 04 '16 at 20:26
@whuber: essentially, I was expecting a proof/explanation why all (sensible) estimators of a parameter have similar MSE. It's implicit in this answer, for instance: http://stats.stackexchange.com/a/20303/80379 Should I ask a new question? — , Mar 04 '16 at 21:08
I don't understand what you are looking for, because "sensible" estimator and "similar" MSEs are subjective impressions, not mathematical concepts. It looks to me like Matthew Drury's graphical explanations in the duplicate thread already are excellent mathematical explanations of the bias-variance trade-off. — whuber, Mar 04 '16 at 21:22
The pictures don't constitute a mathematical explanation. The reason I'm resorting to the words "sensible" and "similar" is because I don't know the precise mathematical truth. So for a concrete example, why is the MSE of the ridge/lasso estimator the same as that of the OLS? The answer I link above suggests that it's an example of the bias-variance trade-off — , Mar 04 '16 at 22:08

What is the trade-off in the bias-variance trade-off?

0 Answers0