How to understand all the loss functions are the same?

Question

I remember I read it many times in CV and in other papers says that "All the loss functions are essentially the same", what is that mean and how to understand it intuitively?

"All *the* loss functions" [my italics] suggests there's a class, though not necessarily a well-defined one, of loss functions under consideration. Could you elaborate a little? (And please reference the linked article more fully.) — Scortchi - Reinstate Monica, Jul 17 '17 at 14:12
@MatthewDrury sorry I spent some time but cannot find it. I roughly remember it was coming from Mark L. Stone's answer or comment? — Haitao Du, Jul 17 '17 at 14:16
Are you referring to https://stats.stackexchange.com/questions/230282/question-about-conventions-for-l1-and-l2-regularization/230289#230289 ? if so, I was not attempting to make any claim along the lines of "All the loss functions are essentially the same". — Mark L. Stone, Jul 17 '17 at 14:30
If indeed "all ... loss functions are essentially the same" then we wouldn't pay much attention to them, would we? Although the paper you reference asks this question (in its title), its conclusions survey several reasons why the answer is "no." — whuber, Jul 17 '17 at 14:34

score 3 · Answer 1 · edited Jun 11 '20 at 14:32

One sensible interpretation of this is that all "loss functions which are derived from finite dimensional norms" are the same. So if for pair $(x_i,y_i)$ your loss function has the form $L_i(x_i,y_i)=\|f(x_i)-y_i\|$, where $\|\cdot\|$ is a norm, then all loss functions become equivalent: there exist constants $c,C>0$ depending only on dimension, the pair of norms involved, and the number of data points, such that for two loss functions $L,L'$, one has:

$$cL'\leq L\leq CL'.$$

All such loss functions are within some maximal deformation of each-other and are topologically indistinguishable. In particular this implies that if your classifier can theoretically become perfect ($L=0$) under loss function $L$, then it will do so also under $L'$, perhaps at a different rate.

For example,

$$\|x\|_2\leq \|x\|_1\leq \sqrt{n}\|x\|_2.$$

However, the world we live in possesses dimensions, and if say $n=100$, then the two norms are within a factor of 10 of eachother, which from an operational point of view could be enormous.

I do not see why this (familiar) definition of *metric equivalence* is relevant to *minimization* of loss functions. Indeed, these norms lead to sharply different solutions in many cases. The simplest concerns finding a number $\mu$ that minimizes $|\mathbf{x}-\mu\mathbf{1}|_p$ where $\mathbf{x}=(x_1,\ldots, x_n)$ and $\mathbf{1}=(1,\ldots, 1)$. For $p=1$ the solution $\hat \mu$ is the median of the $x_i$ whereas for $p=2$ the solution is the mean of $x_i$. It's easy to construct datasets where those values are dramatically far apart (relative to $|\mathbf{x}|_p$). — whuber, Jul 18 '17 at 12:56

How to understand all the loss functions are the same?

1 Answers1