Here's why you can (hopefully) use $R^2$ for non-linear models. Why not?

Question

We calculate $R^2$ as follows:

$R^2 = 1 - \frac{\|y - \hat y\|^2}{\|y - \bar y\|^2}$

$y$ is a vector of true answers;
$\bar y$ is a vector whose elements are mean of $y$;
$\hat y$ is a a vector with our model's predictions.

So, in case of OLS Linear Regression $R^2 = 1 - \sin^2 \theta = \cos^2 \theta$, where $\theta$ is an angle between vectors $y - \bar y$ and $\hat y - \bar y$.

Everybody says that it's forbidden to use $R^2$ in case of non-linear models. So, I've been pondering and trying to imagine why it is so and still disagree with that. Here is my line of reasoning. Suppose we have some non-linear model and here is all you need to know about it:

You can see from this GIF that $\hat y - \bar y$ is not orthogonal to $y - \hat y$. So, $SS_{tot} ≠ SS_{exp} + SS_{res}$, where $SS_{tot} = \|y - \bar y\|^2$, $SS_{exp} = \|\hat y - \bar y\|^2$ and $SS_{res} = \|y - \hat y\|^2$. It's obvious from the Pythagorean theorem. But why do we need that equality to be true? Look at how we calculate $R^2$. We don't actually calculate $SS_{exp}$ explicitly. Instead we calculate it as a difference between $SS_{tot}$ and $SS_{res}$.

Let's look at what's happening when we calculate $R^2$ in the case of our non-linear model:

When you calculate $\|y - \bar y\|^2 - \|y - \hat y\|^2$ in the numerator according to the Pythagorean theorem it is equivalent to calculating the squared length of the vector $(\hat y) - \bar y$. It just means that if your model was linear, then your best fit solution would lie where the green point $(\hat y)$ is lying. $R^2$ is $\cos^2 \theta$. But now $\theta$ is no longer the angle between vectors $y - \bar y$ and $\hat y - \bar y$, but between vectors $y - \bar y$ and $(\hat y) - \bar y$.

This is as much meaningful as it is in case of OLS Linear Regression. So, if everything I said was right, then why can't we use $\mathbf{R^2}$ for non-linear models? If $R^2 = 0.86$ then your model's variance decreased by $86\text%$ (no matter linear or not).

"Everybody says that it's forbidden to use $R^2$ in case of non-linear models." This is just not true. There is a serious case that the square of the correlation between observed and predicted values is something that can generally be calculated. How useful it is, how closely related it is to anything else, and how far experiences with linear regression carry over to other cases are all detailed questions. — Nick Cox, Oct 11 '21 at 16:07
Cox, D. R. and N. Wermuth. 1992. A comment on the coefficient of determination for binary responses. _American Statistician_ 46: 1–4. is a paper warning you to be careful. . Zheng, B. and A. Agresti. 2000. Summarizing the predictive power of a generalized linear model. _Statistics in Medicine_ 19: 1771–1781. is positive about wider use, with cautions. — Nick Cox, Oct 11 '21 at 16:12
@Nick Cox, Isn't it as much meaningful as in case of OLS Linear Regression? In both cases we learn by how much your model's variance decreased w.r.t its initial (total) variance. — mathgeek, Oct 11 '21 at 16:16
For a given problem, where $y$ is fixed (and we are exploring different models), I find the algebraically equivalent value $||y - \hat y||^2$ (or its square root) to be more useful, because it directly expresses a measure of discrepancy. It is a short step from this to evaluating AIC or BIC (when errors are assumed to be Normally distributed). Thus, any criticism of $R^2$ indirectly attaches to these common applications. BTW, please use animations judiciously. Unless they are the only way you can communicate a critical idea, they detract so much from the text they make it almost unreadable. — whuber, Oct 11 '21 at 16:16
Not in general. Linear regression in a strong sense maximizes R-square, although most authors prefer not to emphasize that, whereas it's not true for many other models that R-square is maximized. — Nick Cox, Oct 11 '21 at 16:21
@whuber, Thank you for your comment! But my intention was to make you see with your eyes what I mean. I'm not familiar with AIC and BIC yet. — mathgeek, Oct 11 '21 at 16:21
The animation is clever but doesn't really help (me at all). — Nick Cox, Oct 11 '21 at 16:22
@whuber I am trying to interpret your comment on "measure of discrepancy" in light of [Mathworld's](https://mathworld.wolfram.com/Discrepancy.html) definition of "discrepancy". There could be relation, but these terms and usages are not the same. Could you clarify the term "measure of discrepancy" please? — DifferentialPleiometry, Oct 11 '21 at 16:34
@whuber I think I found the clarification I was looking for [here](https://en.wikipedia.org/wiki/Discrepancy_theory). — DifferentialPleiometry, Oct 11 '21 at 16:40
Some pseudo $R^2$ statistics are discussed here: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/ — G. Grothendieck, Nov 01 '21 at 15:13
@G.Grothendieck "A pseudo R-squared only has meaning when compared to another pseudo R-squared of the same type, on the same data, predicting the same outcome." This tells me that the alternative $R^2$-oid measures are of limited utility when it comes to an absolute measure of performance, rather than when it comes to model comparisons. — Dave, Nov 01 '21 at 15:30
@Dave, I wouldn't pick one item and throw everything else out because of it. Also that is true of many statistics and it isn't entirely true for all the pseudo $R^2$ statistics. For Nagelkerke $R^2$ the statistic is 1 for a perfect fit and 0 for the intercept only model. — G. Grothendieck, Nov 01 '21 at 17:05

score 4 · Answer 1 · answered Oct 11 '21 at 16:51

4

We can use $R^2$ for nonlinear models. In a model comparison, higher $R^2$ means lower $MSE$, even if the models are nonlinear. However, because of the lack of orthogonality, $R^2$ loses its interpretation of proportion of variance explained, so its use as an absolute measure of model performance (“Sure, model A beats model B, but is model A any good?”) is limited.

(For $MSE$, I assume that we divide by $n$ or $n-1$, not $n-p$.)

answered Oct 11 '21 at 16:51

Dave

28,473
4
52
104

2

Thank you for your reply! But why do you say that $R^2$ loses its interpretation of proportion of variance explained? In both cases (I mean linear and non-linear models) we learn by how much your model's variance decreased w.r.t its initial (total) variance. So, why can't you say whether the model A is any good? If $R^2 = 0.86$ then your model's variance decreased by $86\text%$ (no matter linear or not). – mathgeek Oct 11 '21 at 17:10
@mathgeek [When you decompose the total sum of squares, you wind up with the sum of squares of the regression, the sum of squares of the residuals, and an "other" term.](https://stats.stackexchange.com/questions/427390/nonlinear-regression-sse-loss) In OLS linear regression [(but not necessarily every other linear regression approach)](https://stats.stackexchange.com/questions/494274/why-does-regularization-wreck-orthogonality-of-predictions-and-residuals-in-line), that "other" term is zero, meaning that $R^2 =1- \dfrac{SSRes}{SSTotal}$ is the proportion of variability explained by the model. – Dave Nov 01 '21 at 14:45
Doesn't "explained" mean that variance decreased? You can actually calculate R-squared in two ways, $R^2 = \frac{SS_{tot} - SS_{res}}{SS_{tot}}$ and $R^2 = \frac{SS_{exp}}{SS_{tot}}$. Yes, in both cases $SS_{exp} ≠ SS_{tot} - SS_{res}$ for non-linear models, but why do you even need that equality? I mean you always calculate R-squared as $R^2 = \frac{SS_{tot} - SS_{res}}{SS_{tot}}$ and in this case it behaves in the same way for both linear and non-linear models. I just can't figure out why we want $SS_{exp} = SS_{tot} - SS_{res}$ to be true, if we never calculate $SS_{exp}$ implicitly. – mathgeek Nov 01 '21 at 15:05
Why do we want that "other" term to be zero, if the only thing we end up with is "how much variance is decreased"? And for both linear and non-linear models R-squared does its job telling us this value – mathgeek Nov 01 '21 at 15:06
You're always allowed to calculate $R^2 = 1 - \dfrac{SSRes}{SSTotal}$, but without that "other" term being zero, the connection to "proportion of variance explained" is lost. // I do not follow what you mean by "why do we want the other term to be zero?" Could you please elaborate? – Dave Nov 01 '21 at 15:07
[Here](https://i.stack.imgur.com/7nBnq.png)'s your "other" term. And it's not zero in here, right? What difference does it make? What is meant by "explained variance"? Isn't it the same as "decreased variance"? – mathgeek Nov 01 '21 at 15:24
I mean, the connection to "proportion of variance ***decreased***" is still here for non-linear models. But how does it differ from "proportion of variance ***explained***"? – mathgeek Nov 01 '21 at 15:26
You don't need to play games with the English phrasing. You see the equations; it just doesn't work out as cleanly as one would I hope.. // The good news is that $R^2$'s popularity seems to be as an absolute measure of performance, like grades in school where $R^2=0.9$ is an $\text{A}$ that makes us happy and $R^2 = 0.5$ is an $\text{F}$ that makes us sad, but, depending on the problem, $R^2=0.5$ could be quite splendid, while $R^2=0.9$ could be quite pedestrian. – Dave Nov 01 '21 at 15:29
How is that possible that $R^2 = 0.9$ is pedestrian? – mathgeek Nov 01 '21 at 15:32
Imagine that the customer requirement is that they need the RMSE to be less than $2$ for the model to make them a profit, and your model with $R^2 = 0.9$ only gives $RMSE = 2.036$. Simulated in R: `set.seed(2021); x – Dave Nov 01 '21 at 15:35
But as for me, that's not $R^2$'s fault to give such $RMSE$ value. Do you state that for the same value of $R^2$ it's possible to have different values of $RMSE$? P.S. I don't know R-language (only MATLAB or Python). – mathgeek Nov 01 '21 at 16:26
$R^2$ depends on both the (R)MSE and the total sum of squares. If you fix two, the other is determined by the algebra, so, yes, if you have to data sets with different total sums of squares but the same $R^2$ value, you will get different (R)MSE values. – Dave Nov 01 '21 at 16:32
But that's irrelevant, since what you said is true for both linear and non-linear models. But for the same $SS_{tot}$ and $R^2$ you'll always have the same $RMSE$. So, I still can't see what the problem with $R^2$ is. – mathgeek Nov 01 '21 at 16:37
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/131040/discussion-between-dave-and-mathgeek). – Dave Nov 01 '21 at 16:38

Here's why you can (hopefully) use $R^2$ for non-linear models. Why not?

1 Answers1

Linked