1

What is the loss function in the GLMs. We only deal with the mean posterior of response given input $E[Y|X]$, therefore I assume underneath we assume $L_2$ loss. Is that correct? What about other loss functions (e.g. $L_1$ loss or other surrogate loss functions common in ML)? Please point me to relevant documents if I'm just very confused.

1 Answers1

3

GLMs are fit via maximum likelihood so if you want to view it as a minimization, you'd have the negative (log) likelihood as your loss.

For some likelihoods (like a Gaussian linear model) this is equivalent to minimizing an $L_p$ norm but it doesn't have to be. A standard example is logistic regression where the log likelihood for $n$ observations is $$ \ell(\beta\mid y, x) = \sum_{i=1}^n y_i \log g^{-1}(x_i^T\beta) + (1-y_i)\log(1 - g^{-1}(x_i^T\beta)) $$ for $g = \text{logit}$ being the link function.

jld
  • 18,405
  • 2
  • 52
  • 65
  • Maybe I can elaborate a bit more on my question. In general we are trying to minimize $E[L(\delta(X),Y)]$. Assuming that loss function $L$ is $L_2$, then the optimal (Bayes?) estimator is found by $\delta(X) = E[Y|X] = g^{-1}(X^T\beta)$, right? Now if instead of $L_2$ we had $L_1$ the answer would be $median(Y|X)$. In that case, would the answer still be equal to $g^{-1}(X^T\beta)$? What if I use a different loss? – Ahmad Khaled Mar 09 '19 at 02:50
  • @AhmadKhaled in practice the least squares solution will not be quite the same as the maximum likelihood solution as the logs in the log likelihood make it much worse for that one to be confidently wrong (e.g. for a single point, with least squares the worst case loss is bounded above by $1$ which is for predicting a $0$ when $y_i = 1$ or vice versa; for the MLE the worst case for one point is unbounded since $1 \cdot \log 0 = -\infty$). I've written about this a bit [here](https://stats.stackexchange.com/questions/326350/). – jld Mar 12 '19 at 16:10
  • I think I have a better picture now. Assuming other loss functions won't change the underlying problem of estimating the best parameter for $g^{-1}(X^T\beta)$. Assuming $L_1$ loss and solving $median(Y|X)=g^{-1}(X^T\beta)$ won't change anything as the problem is still to find the best $\beta$ (because we assumed a linear model in parameter space anyways). In most cases, anything other than mean is useless, for example in Bernoulli distribution the median is either 0 or 1, and solving $g^{-1}(X^T\beta)=0$ or $g^{-1}(X^T\beta)=1$ won't give us a useful beta. – Ahmad Khaled Mar 27 '19 at 02:24