Best loss function for nonlinear regression

Question

I have some nonlinear nonnormal data that I am trying to analyze. The data has been normalized to -1 to 1 and detrended with polynomials of an order of 3. I'm trying to determine if there is a special loss function for nonlinear regression type problems. I did some googling but not much showed up. Right now I'm sticking with MSE.

The best loss function is the one that best represents the real-world loss incurred from inaccurate predictions in production for a specific application. — Bernhard Barker, Aug 24 '21 at 09:35
@BernhardBarker, generally a good idea. Sometimes it can be improved upon; training loss does not have to match evaluation loss if we want to achieve the best out-of-sample results; see Dave's and my comments under Stephan Kolassa's answer. — Richard Hardy, Aug 24 '21 at 10:42

Tim · Accepted Answer · 2021-08-23T15:57:34.417

7

There's no such a thing as a loss function "for" a particular kind of model. You could be using nonlinear regression with different loss functions. There are many loss functions and you can even construct one yourself. The choice depends on the nature of your problem and the data you are dealing with. Recall that minimizing some loss is equivalent to maximizing a likelihood function (e.g. using squared error is an equivalent of assuming Gaussian likelihood function), so it is tightly connected to the assumptions you are making about the distribution of errors.

More formally, if you think of the model as of something like

$$ y = f(X) + \varepsilon $$

then the choice of model (e.g. linear regression, nonlinear regression, deep neural network, etc) is related to estimating the expectation $E[y] = f(X)$, while the choice of the loss function impacts how do you treat the residuals $y - f(X) = \varepsilon$.

For example, choosing squared error over absolute error penalizes outliers more, so it would be preferable if this is what you want to achieve. On another hand, absolute error is less prone to outliers, this can be an advantage in another scenario.

The most common choice is defaulting to squared error, though it is somehow an arbitrary choice and doesn't have to be the best in all cases.

edited Aug 23 '21 at 15:57

answered Aug 23 '21 at 15:30

Tim

108,699
20
212
390

the errors are not normal, data is nonlinear, nonnormal...i probably got the time lag wrong...i just know that regression doesn't work well. – intuition Aug 23 '21 at 17:19
1

@intuition data cannot be nonlinear, function can be linear or not. Moreover, linear regression can in many cases approximate well such cases. Normality of residuals has nothing to do with the nature of functional relationship. What can you assume about the residuals? – Tim Aug 23 '21 at 18:41
that they aren't noise is about the only thing i can assume – intuition Aug 23 '21 at 19:53
1

@intuition what do you mean by that? – Tim Aug 24 '21 at 05:03
1

*then the choice of model (e.g. linear regression, nonlinear regression, deep neural network, etc) is related to estimating the expectation*: does this apply e.g. to quantile regression? E.g. in your example of absolute loss, one obtains the conditional median. I guess this also relates to Stephan Kolassa's answer. – Richard Hardy Aug 24 '21 at 06:32
@RichardHardy yes, if you want to be precise, it's a [kind of simplification](https://stats.stackexchange.com/questions/173660/definition-and-delimitation-of-regression-model/211229#211229). – Tim Aug 24 '21 at 07:32

score 3 · Answer 2 · answered Aug 23 '21 at 15:38

3

Other answers (like bdeonovic's and Tim's) discuss "robustness to outliers". I have to admit that while this point of view is extremely common, I do not like it very much.

I find it more helpful to think in terms of which conditional fit (or prediction) we want.

Use the squared errors if you want conditional expectations as fits or predictions. ("Outliers" are then simply observations that are "far away" from the expectation, and which therefore pull the expectation towards them. If your aim is an expectation fit/prediction, then you should think long and hard about whether you want "robustness to outliers", because "outliers" are a fact of life.)
Use the absolute errors if you want condititional medians as fits or predictions.
Use quantile (AKA pinball) losses if you want conditional quantiles as fits or predictions.

I have written a short paper (Kolassa, 2020, IJF) on this, in the context of forecasting - but the idea holds in the precise same way for fits.

Thus, I would recommend you think about what kind of fit/prediction you want, and then tailor your loss function to this.

answered Aug 23 '21 at 15:38

Stephan Kolassa

95,027
13
197
357

I'm not convinced of this. For example, if $X$ has a Laplace distribution, the mean is a viable estimator of $\mathbb{E}[X]$, but the median is considered better. Applying a similar idea to regression, explicitly estimating conditional medians could be a better way of estimating conditional means than explicitly estimating conditional means, could it not? – Dave Aug 23 '21 at 16:39
@Dave: I would ask "*better* in what sense"? At Wikipedia, you would read "the median is considered better[*by whom?*]" And if your goal is to estimate the conditional median, then sure, use absolute errors. My point is that figuring out which functional of the unknown distribution we are interested in is the key question that determines which loss function to use. – Stephan Kolassa Aug 23 '21 at 17:15
It's the lower variance estimator and also unbiased. – Dave Aug 23 '21 at 17:17
Dave has a point there. Empirical counterparts of theoretical quantities of interest are not always the best estimators. *Better* in the sense that empirical median is the MLE of the target parameter, and MLE has certain optimality properties (so it can be considered *best* in a well defined sense). – Richard Hardy Aug 24 '21 at 06:39
Off topic here but: might you be interested in ["Assess calibration of a density forecast by Kolmogorov-Smirnov test on PIT of realized values"](https://stats.stackexchange.com/questions/529963) (with a bounty)? – Richard Hardy Aug 24 '21 at 13:13

score 2 · Answer 3 · answered Aug 23 '21 at 15:21

Most of the alternative loss functions are for making the regression more robust to outliers. I've seen all of the following in various software package implementations, but I haven't looked too hard into the literature comparing them

least absolute deviation
least median of squares
least trimmed squares
metric trimming
metric winsorizing
Huber Loss
Tukey's biweight loss
soft L1 loss
Cauchy loss
arctan loss

How are you doing the optimization? Did you code it yourself? Are you using Gaus-Newton or Gradient Descent? May want to consider Levenberg–Marquardt (interpolates between Gaus-Newton and Gradient Descent)

Using adam as an optimizer, its with a bilstm-rnn model thats been further optimized using GA. — intuition, Aug 23 '21 at 17:15

Best loss function for nonlinear regression

3 Answers3