6

I know that if the cost functions are respectively the least squares ($L^2$) and the absolute deviation ($L^1$), the solution to linear regression is the conditional mean and the conditional median respectively. To see this, a simple method will be to set the derivative of the cost function to 0, as follows.

\begin{align*} \frac{\mathrm{d}}{\mathrm{d}\beta} ||y-\beta||^2&=0 \implies \beta = \frac{1}{n}\sum_i y_i,\\ \frac{\mathrm{d}}{\mathrm{d}\beta} \sum_i |y_i-\beta| &= 0 \implies \sum_i \text{sgn}(y_i-\beta) =0 \implies \beta = \text{median}(y_i). \end{align*}

I have read that the conditional mode comes into play for a uniform cost function, i.e., $C(y, \beta) = 1$ for $|y-\beta|>\epsilon$ and 0 else, as $\epsilon\to 0$. I repeat the derivative step above to get to: $$\lim_{\epsilon \to 0}\sum_i-\delta(\beta - \overline{y_i-\epsilon}) + \delta(\beta - \overline{y_i+\epsilon}),$$ ($\delta$ being the Kronecker delta).

  1. How do we get to the conditional mode from the last step?
  2. The MAP estimate is also linked to the uniform cost function, but what is the precise relationship between MAP estimate and the above derivation?
Bravo
  • 515
  • 5
  • 9
  • The MAP (maximum a posteriori) estimator is the mode of the posterior distribution. – Henry Nov 26 '21 at 16:15
  • Your mean and median expressions do not seem to take account of the posterior distribution. If they did, by minimising the expected loss, then you would get the conditional mean and conditional median. The third expected loss expression would then be $1$ minus the probability of being in a particular interval/neighbourhood of length/diameter $2\epsilon$ which (assuming $\epsilon$ small enough and some smoothness in the posterior distribution) would be minimised by an interval/neighbourhood of highest probability containing the conditional mode, so converging on the mode as $\epsilon$ reduces – Henry Nov 26 '21 at 16:27
  • 1
    This is interesting. Can you tell us where exactly you found this claim? It seems to contradict [Heinrich (2014, *Biometrika*): "The mode functional is not elicitable"](https://www.jstor.org/stable/43305608). – Stephan Kolassa Nov 26 '21 at 17:33
  • The differentiations in the $L_1$ and modal case are invalid because the functions involved are not everywhere differentiable: this fact further requires you to examine possible solutions where the derivative is undefined. – whuber Nov 26 '21 at 17:36
  • @StephanKolassa: I recalled the results from "An Introduction to Signal Detection and Estimation" by Vincent Poor, but they were derived differently I think. – Bravo Nov 26 '21 at 17:37

1 Answers1

0

I got the answer from the paper, Is the mode elicitable relative to unimodal distributions?.

If we consider the cost function, $C(x,y)=\mathrm{1}_{x\ne y}$, the minimization corresponding to mode is $\beta = \min \sum_i \mathrm{1}_{y_i\ne \beta}$. The minimum is obviously attained at the mode, i.e., $\beta = \text{mode}(y_i)\mid_{i=1}^n$. The paper also talks about the mode not being "elicitable" wrt Lebesgue densities.

The above derivations can also be carried out inside the conditional expectation $\mathbb{E}[\cdot\mid X=x]$, which gives rise to the conditional mean, median and mode of the posterior distribution as a solution of the regression with the corresponding cost function.

Bravo
  • 515
  • 5
  • 9
  • I don't think I quite understand. Your calculations will give the mode of a given finite dataset, yes. (Although by a rather roundabout logic.) And it will work on any *counting* density (p. 2 middle of the PDF). But that does not carry over to continuous densities, which one would assume your original question in the context of *linear regression* presupposes. Also, I don't quite understand your last paragraph, and finally, on skimming the paper, I don't find anything that sounds like what you are writing here. Would you perhaps elaborate a bit? – Stephan Kolassa Nov 29 '21 at 14:53