-1

Why are the "Loss Functions" being Optimized in most Statistical/Machine Learning Problems usually "Quadratic"?

Using very basic logic, in statistics/machine learning we are trying to minimize the "error" between a model and some "hypothetical ideal function" that perfectly models the data. This "error" is typically described as "Mean SQUARED Error" (MSE) - and the function characterizing this error is what we are trying to minimize. "Squared" is said to be the main characteristic of "quadratic functions" (i.e. the highest power term in a quadratic function is "power 2", i.e. squared). I have heard the reason that loss functions are typically quadratic is because quadratic functions have certain "attractive and desirable theoretical properties" that facilitate convergence for such tasks as root finding and minimization - but I am not sure about this and what these properties are:

enter image description here

Thus - no matter how many parameters in the loss function being optimized (e.g. neural networks with many weights/layers), and no matter how complex the behavior of the loss function being optimized : Is it safe to assume that the loss functions being optimized in most statistical/machine learning problems are usually "quadratic"? And does anyone know why this is?

Can someone please comment on this?

Thanks!

Note : I have heard similar arguments being made in reference to describing the "non-convexity" of loss functions being optimized in most statistical/machine learning problems. Although there are standard definitions in mathematics used to determine whether a function is convex or non-convex (e.g. https://en.wikipedia.org/wiki/Convex_function, "Definitions"), high dimensional loss functions containing "random variables" (opposed to non-random deterministic functions in classical analysis, i.e. "noisy") are almost always said to be non-convex. This has always made me wonder that even though optimization algorithms like Gradient Descent were designed for convex and non-noisy functions - they are still somehow able to display remarkable success when optimizing non-convex and noisy functions.

stats_noob
  • 5,882
  • 1
  • 21
  • 42
  • 1
    A bit related: ["Why is using squared error the standard when absolute error is more relevant to most problems?"](https://stats.stackexchange.com/questions/470626/why-is-using-squared-error-the-standard-when-absolute-error-is-more-relevant-to/470786#470786). – Richard Hardy Jan 21 '22 at 19:34
  • https://stats.stackexchange.com/q/132622/17230 also – Scortchi - Reinstate Monica Jan 21 '22 at 19:35
  • @Scortchi-ReinstateMonica Good find! That Q seems like a dupe to the titular question; on the other hand, doesn't address whether all loss functions are quadratic – Sycorax Jan 21 '22 at 19:36
  • The reason it doesn't address that is because it's not the case that all loss functions are quadratic. Perhaps the most notable and important ones appear in the theory of robust statistics. https://stats.stackexchange.com/questions/251600 discusses another common non-quadratic loss. It's also important to understand that "loss functions" play several different roles in statistical analysis. – whuber Jan 21 '22 at 19:56
  • Thank you everyone! Just to clarify - even for neural networks with thousands of weight parameters... the loss function is usally still quadratic? Is it safe to assume that in these cases, the loss function is non convex? – stats_noob Jan 21 '22 at 20:41
  • 1
    Depends on what you mean. From the perspective of the network weights, neural networks are not strongly convex in general (but there are special cases, like logistic regression or OLS regression which are convex, possibly even strongly convex); this is because NN weights/biases can be reflected or permuted in specific ways and achieve the same loss value. From the perspective of the predicted values alone, $(\hat{y} - y)^2$ is a quadratic with positive leading coefficient, therefore it is convex. – Sycorax Jan 21 '22 at 21:11
  • @ Sycorax: thank you for your reply! Can you please go over it again? Why is this distinction important? If the function is purely in y : convex. If y itself is written in weights, its non convex. Why is this distinction important? – stats_noob Jan 22 '22 at 01:46

2 Answers2

4

No, it is not safe to assume that all loss functions are quadratics. One of the most common cost functions is the binomial cross-entropy $$ L(p) = y\log p +(1-y)\log(1-p) $$ where $0 \le p \le 1$ and $y \in \{0,1\}$. The function $L$ is not a quadratic because it is not a polynomial of degree 2.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • 1
    Asymptotically, though, this *is* quadratic, and afaik almost all results based on this loss are asymptotic ones. In fact, just about every appearance of the chi-squared distribution in statistics owes its justification to such a phenomenon. – whuber Jan 21 '22 at 19:53
2

Five reasons come to mind quickly.

  1. Square loss brutally punishes bad misses. If you miss by $1$, your square loss is $1$, but if you miss by $2$, your square loss is $4$. This helps keep a model from making gigantic errors.

  2. Square loss is related to the variance of an error term, if you’re willing to assume that variance to be constant.

  3. Minimizing square loss seeks out the conditional expected value.

  4. Minimizing square loss corresponds to maximum likelihood estimation if the conditional distribution is Gaussian. Maximum likelihood estimation is a technique with which statisticians are quite comfortable.

  5. Tradition! The first type of machine learning one learns is “trendline” in Excel. Then one learns how to extend that by learning simple and then multiple linear regression via ordinary least squares, which minimizes square loss.

That being said, many other loss functions are popular. “Crossentropy” is popular for classification problems, for instance. The neural network libraries like Keras list many popular loss functions that are built-in.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • @ Dave: thank you so much for your answer! Just to clarify - even for neural networks with thousands of weight parameters... the loss function is usally still quadratic? Is it safe to assume that in these cases, the loss function is non convex? – stats_noob Jan 22 '22 at 01:20
  • @stats555 The loss might be quadratic but might not. Nothing forces you to use square loss. – Dave Jan 22 '22 at 02:01