4

It seems to be like the most popular loss function for regression, for everything from OLS (it's in the name!) to sophisticated regularized regressions.

Why is it so popular and what are the drawbacks?

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
badmax
  • 1,659
  • 7
  • 19
  • 2
    This must be a duplicate, isn't it? It would be very surprising if this question had not been asked before. – Richard Hardy Oct 31 '17 at 19:04
  • 1
    The first sentence in the question is a bit strange. It seems like the part not a complete subsentence. – Sextus Empiricus Oct 31 '17 at 20:52
  • 1
    Richard, in fact the 118th question on this website touches on this topic and the answer by Jen contains very useful links to work of Gorard which mostly answers this question https://stats.stackexchange.com/questions/118/ possibly the OP had a more sophisticated discussion in mind, but that would be very broad, and some direction from the OP would be nice. For instance, I do not get whether the OP has a focus on regularization or about norms / errors. – Sextus Empiricus Oct 31 '17 at 21:09

2 Answers2

4

Ideally the loss function should reflect the losses that are caused to you by the forecast errors. So, in this ideal setup there are no advantages or disadvantages of loss functions so long they represent your losses appropriately.

For instance, if any over or under prediction by $\Delta y$ units of items sold leads to $\$110\times (\Delta y)^2$ losses then there are no disadvantages of the $L(\Delta y)=(\Delta y)^2$ loss function. It's just the reality whether you like it or not. Using any other loss function would have been simply wrong, not advantageous or disadvantageous.

Unfortunately, almost nobody even tries to construct the true loss function these days. There could be many reasons why we don't do it anymore, but the practice is such that we choose loss functions based on convenience. This leads to them having advantages and disadvantages over each other.

So, for instance, if your error distribution is Cauchy, then the least squares loss function will lead you to nowhere, since they're linked to the expectations (moments) which do not exist for Cauchy. On the other hand, the least absolute values will produce a solution for Cauchy, since they're linked to the median which does exist for this distribution. In this regard the least squares are less robust than absolute values. On a related note, the least squares models are sensitive to outliers.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • *Ideally the loss function should reflect the losses that are caused to you by the forecast errors.* Do you mean the loss function used for fitting the model? If so, you are incorrect. Counterexample: suppose the true distribution is $N(\mu,\sigma^2)$, the real-world loss is absolute loss, and the quantity of interest is a point forecast of a new observation. According to you, we should estimate $\mu$ by the sample median which corresponds to absolute loss, and then take it as the point forecast. ... – Richard Hardy Nov 02 '17 at 18:15
  • 1
    ...I say you get better forecast accuracy if you use the maximum likelihood estimator for $\mu$, which in this case is the sample mean and corresponds to square loss (thus a mismatch!), and take it as the point forecast. This is simply because the maximum likelihood estimator for $\mu$ is more efficient than the sample median. In any case, your answer does not seem to address the OP's question... – Richard Hardy Nov 02 '17 at 18:16
  • @RichardHardy, you're absolutely right about the median, glad that you got it without me writing it explicitly. If your business depends on the estimation of $\mu$, and the cost of forecast error is linear to the error, then the median is the way to go – Aksakal Nov 02 '17 at 18:30
  • What exactly are you saying there? Saying that the true median is the optimal point forecast under absolute loss is correct. Saying that you should estimate it as the sample median when the distribution is $N(\mu,\sigma^2)$ is wrong because a more efficient estimate - the sample mean - is easily available, as I said above. – Richard Hardy Nov 02 '17 at 18:37
  • @RichardHardy What's more efficient? Imagine that the loss is very asymmetrical but distribution is symmetrical. In this case the mean estimate will not be optimal in terms of loss. – Aksakal Nov 02 '17 at 18:39
  • 1
    OK, so you are starting a new example, right? Because you would be wrong if talking about my example. – Richard Hardy Nov 02 '17 at 18:56
  • @RichardHardy, no, your example is good too. For instance, you are estimating an optimal dosage of meth, and the error is Gaussian like in your case. If you under dose, then customer is less satisfied, if you overdose the customer dies. Would you really go with mean estimate of $\mu$? I'd go with a biased recommendation, biased below the mean estimate $\mu$. It's less trouble to deal with a dissatisfied customer rather than with friends of a dead one. – Aksakal Nov 02 '17 at 19:09
  • 2
    this is not my example, because in my example the real-world loss is absolute. Now in your example, I think your explanation misses the difference between model estimation and forecasting from the model. In estimation you want a precise estimate and you use the loss function that gives you it. Once you have it, you tailor the forecast to the real-world loss function. Your example only illustrates that the point forecast needs to be tailored to the real-world loss function. – Richard Hardy Nov 02 '17 at 19:33
  • @Richard Hardy: This is interesting discussion, because as presented in texts that I have seen, decision theory tends to focus (in the theory part) on building from real-life custom-built loss functions, but then in examples tend to use some standard ones ... And I have never seen a decision theory text mentioning the topic of proper score functions, which indeed seems relevant for decisions ... – kjetil b halvorsen Oct 10 '18 at 12:08
  • 1
    @kjetilbhalvorsen, I agree. I have pondered upon estimation vs. use of the model a lot in the recent years (also posted a few related questions on here on CV) and am still trying to form a coherent view of that. Still learning, hopefully progressing in the right direction... – Richard Hardy Oct 10 '18 at 13:05
  • @Richard Hardy: Yes. And I have stilled not seen a regression/modeling book with a chapter about *use* of the estimated model! – kjetil b halvorsen Oct 10 '18 at 13:27
3

The second most popular choice to minimizing the squared distance (L2 loss) of predictions and targets is the absolute distance (L1 loss).

The first big difference is that L2 loss places much more weight on outliers, because the squared distance is proportionally much bigger.

The second big difference is the assumed distribution around the trend. L2 loss assumes the residuals are gaussian, and L1 loss assumes they are laplacian. See this discussion for more details.

user2089357
  • 448
  • 2
  • 9
  • 7
    Neither of those loss functions makes distributional assumptions, nor do they imply any such assumptions unless you assume they are associated with likelihoods. These points therefore could benefit from clarification or further elaboration. – whuber Oct 31 '17 at 16:42
  • 2
    Linear least squares regression problems -- even those with elaborate basis expansions and interaction terms -- can be solved efficiently in closed form (iterative solutions are unnecessary), and this is also the case for least squares solutions with quadratic penalties on the coefficients (such as ridge regression or the "wiggliness" penalty in MGCV). This is a huge computational/practical advantage. – Josh Oct 31 '17 at 17:43
  • 3
    Also, the link in this answer is a discussion about ridge regression or L2 regularization. It's not immediately obvious what that has to do with the loss function. In principle, parameters could be fit such that they minimize L1 error while being subject to an L2 penalty. The reverse of this -- L2 loss, L1 penalty -- is the LASSO. – Josh Oct 31 '17 at 17:51
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/68027/discussion-on-answer-by-user2089357-what-are-the-drawbacks-of-using-least-square). – gung - Reinstate Monica Nov 01 '17 at 18:11