1

As far to my knowledge r-squared should not be used in non-linear regression setup. Not only might the r2 be too high, but also the interpretation as the variance explained by the model might no longer be valid.

But looking at the mean squared error, it seems to be very similar. But it is said everywhere that mean squared error can be used in non-linear regression model. So my question is: is mse valid or should it also be considered not valid for non-linear regression?

janrth
  • 69
  • 5
  • I'd prefer to define R-squared as the square of the correlation between observed and predicted response. Other definitions give the same result in some but not all cases. It is sometimes useful and sometimes not, but graphs and other numerical results should always be looked at any way. I would not say that it should not be used, as dogma. – Nick Cox Dec 21 '21 at 13:42
  • But mean squared error is only indirectly comparable as it has quite different units. I'd take a root to get back to the original scale, but even then one RMSE can be hard to assess, unless compared with others. Yet a benefit of RMSE should be that a subject-matter expert (e.g. scientist, engineer) should have a feeling for what kind of misfit is expected. – Nick Cox Dec 21 '21 at 13:42
  • 1
    This question comes from our discussion in the comments [here](https://stats.stackexchange.com/a/557389/247274), though this is not asking quite the same question we were discussing. (This question seems to concede my point that minimizing $MSE$ and maximizing $R^2$ are equivalent.) – Dave Dec 21 '21 at 13:44

2 Answers2

2

There are two common reasons why square loss (such as $MSE$) is popular.

  1. Large misses are brutally punished. If your residual is $1$, your squared residual is $1$, but if your residual is $2$, your squared residual is $4$, and if your residual is $10$, your squared residual is $100$. By increasing the residual by a factor of ten, the square loss penalty is increased by a factor of $100$.

  2. If we assume Gaussian error terms, the least squares solution is equivalent to maximum likelihood of the regression parameters.

(If others comment about other reasons why square loss is popular, I can edit to include them, but these are the ones that come to mind quickly.)

The way to compare which model does a better job of minimizing square loss is to look at which model minimizes such a value, so of course there is a sense in which $MSE$ is a legitimate performance metric.

A typical criticism of $R^2$ is that it can be driven arbitrarily high by overfitting, and this criticism is legitimate. However, measures of square loss like $MSE$ and $RMSE$ suffer from the same issue. If $R^2 = 1$ then $RMSE = 0$ and $MSE=0$.

A remedy for this is to do out-of-sample testing, such as the cross validation for which this Stack is named. While we might be able to drive $R^2$ up to $1$ by including features that are unrelated to the outcome and give a regression model that fits the noise instead of the signal (something like playing connect-the-dots with the scatterplot), out-of-sample performance will be poor when this is the case, hence the appeal of out-of-sample testing in machine learning.

There is an out-of-sample $R^2$:

$$ R^2_{out} = 1 - \dfrac{n_{out}MSE_{out}}{\sum_{i = 1}^{n_{out}}\big(y_i - \bar y_{in}\big)^2}\\ n_{out}\text{: Number of observations in the out-of-sample data}\\ MSE_{out}\text{: Mean of the squared residuals for the out-of-sample predictions}\\ \bar y_{in}\text{: Mean of the in-sample response variable} $$

Notice that the subscripts in that equation indicate out-of-sample numbers, except for the mean $\bar y_{in}$.

To understand why, consider what in-sample $R^2$ measures: a comparison of the model under consideration compared to a model that naïvely guesses the pooled mean of $y$ every time in its attempt to model the conditional mean. It makes sense to consider such a model to be the baseline when we test out-of-sample. If we cannot do better than just guessing $\bar y_{in}$ every time, we have done a poor job of modeling the conditional mean.

The denominator term of $R^2_{out}$ is constant for a given test set, regardless of the model. Consequently, maximizing $R^2_{out}$ is equivalent to minimizing $MSE$.

REGARDING THE NOTEBOOK you linked in the comments, you made at least two mistakes.

  1. A quadratic relationship does not preclude linear modeling. Indeed, $y_i = \beta_0 +\beta_1x_i +\beta_2x_i^2 +\epsilon_i$ is a linear model and would give quite a good fit to your parabolic scatterplot. A nonlinear regression would be something like a neural network with $ReLU$ activation functions in order to fit the parabola, but you could do the same with the first plot.

  2. You are comparing models of different data. The critical part of my argument is that the denominator term in the $R^2$ equation, either in-sample or out-of-sample, is constant. If you change that value, then you can get situations where lower $MSE$ corresponds to lower $R^2$.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Thank you for that detailed answer. I am still studying it! In the meantime: what would your top 2 metrics be to validate a black-box ML model? – janrth Dec 21 '21 at 19:30
  • What do you mean by "validate"? [Note that a model can be "wrong" yet have some nice-looking performance.](https://stats.stackexchange.com/a/539121/247274) If the model in the link is good enough to make me a squillionaire, then there is a sense in which it is valid, even if it is incorrect. // [A question that might interest you.](https://stats.stackexchange.com/q/554331/247274) // [Another one might help explain why I am adamant about how I define $R^2_{out}$.](https://stats.stackexchange.com/a/530987/247274) – Dave Dec 21 '21 at 21:35
  • Then let me rephrase: What is the metric you used most when validating a regression model (taking in considerations all types of applications, from kaggle up to model being in production). – janrth Dec 22 '21 at 21:26
  • I like the idea of the out of sample r2. I definetly want to try that! But there is just one thing that still does not make sense in my head. Why is there research around saying r-squared is not valid for non-linear regression (while this comes mostly from statisticians, that use it as an in-sample method to validate a fitting)? Also I tried to find more theoretical papers on r-squared in the pure ML context saying that you can actually use it in non-linear settings. I found this: https://peerj.com/articles/cs-623/, but it does non discuss linear vs non-linear case. I feel lost tbh :) – janrth Dec 22 '21 at 21:33
  • 1
    I would imagine that much of the criticism about $R^2$ comes from the fact that the in-sample value can be driven arbitrarily high by fitting to the noise, while an out-of-sample MSE value is penalized for fitting the noise. People have recognized this fact and are right to want to penalize fitting yo the noise. However, in-sample MSE can be driven arbitrarily low. I wonder how much of the criticism of $R^2$ comes from (reasonably) disliking in-sample metrics and the fact that out-of-sample MSE is straightforward to calculate. – Dave Dec 23 '21 at 00:38
  • 1
    Regarding that PeerJ article, I am skeptical about any metric that claims to give a sense of absolute performance. Setting aside the issues with classification accuracy, it sounds pretty good to train a classifier that has $90\%$ accuracy, right? If you’re working on the MNIST digits, such performance is pedestrian. It’s not valid to say that a score of $0.9$ is awesome for every problem or that $0.5$ is awful for every problem. – Dave Dec 23 '21 at 00:43
  • Thank you so much for the discussion! I will spend some time over Christmas without a Laptop and then continue to think about this topic after Christmas again! – janrth Dec 23 '21 at 20:40
1

I quote two sentences from Applied Linear Statistical Models (Kutner) that are relevant for the question.

p.525: the error sum of squares SSE and the regression sum of squares SSR do not necessarily sum to the total sum of squares SSTO. Consequently, the coefficient of multiple determination R2 = SSR/SSTO is not a meaningful descriptive statistic for non-linear regression.

p.528: For nonlinear regression, MSE is not an unbiased estimator of sigma2, but the bias is small when the sample size is large.

So unlike linear regression, the inference procedure for nonlinear regression based on MSE is only approximate, but it is good enough for large samples.

user344849
  • 106
  • 5
  • 2
    How do you reconcile MSE being good enough in large samples and $R^2$ not being a meaningful descriptive statistic? Given the relationship, this seems like a contradiction. // Depending on what definition of MSE you use (for instance, $n$ denominator vs $n-1$ vs $n-p$), $MSE$ might not be an unbiased estimator of $\sigma^2_{\epsilon}$ in a *linear* regression. // If we assume a Gaussian error term, minimizing $MSE = \frac{\sum_{i=1}^n\big(y_i - \hat y_i\big)^2}{n}$ is equivalent to maximum likelihood estimation of regression parameters, even if the regression is nonlinear. – Dave Dec 21 '21 at 14:53