What is the theoretical justification for alternatives to MSE minimisation?

Question

I'm trying to wrap my head around the connection between statistical regression and its probability theoretical justification. In many books on statistics/machine learning, one is introduced to the idea of the loss function, which is then typically followed by a phrase of the flavour 'a popular choice for this function is mean squared loss'. As far as I understand, the justification for this choice stems from the theorem that

$$ \arg\min_{Z \in L^2(\mathcal{G})} \ \mathbb{E} \left[ (X - Z)^2 \right] = \mathbb{E} \left[ X \Vert \mathcal{G} \right] \tag{1} $$

where $X$ is the random variable to be estimated based on the information contained in $\mathcal{G}$. As far as I understand, probability theory teaches us that the conditional expectation $\mathbb{E}[X \Vert \mathcal{G}]$ is the best such estimate. If that's the case, why should our loss function still be a choice? Clearly we should be statistically estimating $\mathbb{E}[X \Vert \mathcal{G}]$, which by (*) implies minimizing the MSE.

An answer which I have often read is that we simply define the conditional expectation to satisfy (1), but that's doesn't seem true, as we have conditional expectations for any random variable in $L^1$. More importantly, there exists an intuitive theoretical explanation for why this definition gives us an estimator capturing all the information available after observing $\mathcal{G}$: we're using $\mathcal{G}$ to partition the total probability into possible paths and averaging over the remaining randomness in each of these. This interpretation in terms of information and $\sigma$-algebras has nothing to do, as far as I can tell, with minimizing MSE, we could have come up with it without ever knowing (1).

So my question really is: does minimizing MSE represent the theoretically optimal criterion, and if so, are we saying that any alternative (such as LAD) inherently represents a loss of theoretical optimality in favour of good estimation properties, etc.? Are necessarily leaving (as the explanation in the previous paragraph suggests) information contained in $\mathcal{G}$ on the table? And how do we quantify 'how much information' of $\mathcal{G}$ an estimator based on a different criterion (say, the median in case of LAD) utilises?

I've asked this question already on Mathematics Stack Exchange but I'm still not completely satisfied, so I was hoping someone here could maybe illuminate me. Judging by the number of similar questions on this subject, this is probably a phase all students of statistics pass through.

conditional expectation [‖] is the best such estimate *in the least squares sense*. Mean absolute deviation might be appropriate in presence of outliers, in general maximising likelihood will give you lots of different answers depending on the noise distribution you assume. Start with a textbook. Do you have one? if so point to your misunderstandings. If you are learning ML then elements of statistical learning is a suitable mathematically advanced one. — seanv507, Jan 02 '22 at 12:49
I am indeed using ESL together with probability theory course notes (my education has been strongly skewed to pure math). My misunderstanding is: if you read arguments about measurability etc. then $\mathbb{E} \left[ X \Vert \mathcal{G} \right]$ is *not* just the least squares sense best estimate, but the estimate which maximally utilises the information in $\mathcal{G}$ by averaging over the remaining randomness. This is expressed by the partial averaging property. — DominatedConvergence, Jan 02 '22 at 13:32
(By partial averaging property I mean $\mathbb{E} \left[ \mathbb{E} \left[ X \Vert \mathcal{G} \right] \mathrm{I} \vert_{G}\right] = \mathbb{E} \left[ X \ \mathrm{I}\vert_{G} \right] $ for $G \in \mathcal{G}$) — DominatedConvergence, Jan 02 '22 at 13:33
"f you read arguments about measurability etc. then E[X∥G] is not just the least squares sense best estimate, but the estimate which maximally utilises the information in G by averaging over the remaining randomness" What does utilises mean? The answer depends on your goal. Some people have their own custom function like - if I predict answers in the range of (60-80) wrong that is terribly costly. I do not understand your question because the whole concept of "utilizes" or optimilaity is subjective. — user3494047, Jan 02 '22 at 13:41
@user3494047 My intuition for 'utilizing all information' up to this point was the partial averaging property, but I'm guessing from the answers this is not the way to go. Would this mean that answers like [this](https://stats.stackexchange.com/questions/230545/intuition-for-conditional-expectation-of-sigma-algebra) one ("given only the information from $\mathscr{G}$, and not the whole of information from $\mathscr{F}$, $\mathbb{E}[\xi|\mathscr{G}]$ is in a rigorous sense our best possible guess for what the random variable $\xi$ is") already assume that 'best' is measured in terms of MSE? — DominatedConvergence, Jan 02 '22 at 13:50
If you use a word like *all information* then that too requires a definition: Fisher's variance of the score has an underlying squared term — Henry, Jan 02 '22 at 14:13
Perhaps relevant: https://stats.stackexchange.com/questions/470626/why-is-using-squared-error-the-standard-when-absolute-error-is-more-relevant-to/470786#470786 — Richard Hardy, Jan 02 '22 at 14:45
(1) Sometimes we care less about an unbiased estimation of some underlying $X$, and more about smallest MSE *prediction* of some observable. In this case, the bias-variance trade-off implies that we can accept some bias if we reduce variance more, and now regularization is interesting. (2) Sometimes we want to understand other functionals than the expectation, e.g., quantiles. Then we will use a quantile loss. — Stephan Kolassa, Jan 03 '22 at 07:51

What is the theoretical justification for alternatives to MSE minimisation?

0 Answers0