$R^2$ (coefficient of determination) and linearity in multiple linear regression

Question

For simple linear regression (SLR), in order for $R^2$ (the coefficient of determination) to be a meaningful measure, it must be true that $X$ and $Y$ are linearly correlated. Specifically, $R^2=r^2$, where $r$ is Pearson's correlation coefficient.

When we move into the multiple linear regression (MLR) framework, I'm curious how this linearity requirement transfers.

Take, for example, a polynomial model where $\hat y = \hat \beta_0 + \hat \beta_1x+ \hat \beta_2x^2$. In this model, assume the regression assumptions are met (i.e., the data are truly related according to a parabolic curve, so fitting $x^2$ as a predictor allows us to meet the linearity assumption).

Now, $X$ and $Y$ are not linearly related, but $X$ and $X^2$ jointly allow us to correctly model the association with $Y$. So, since we've met the linearity assumption of MLR, $R^2$ is meaningful, correct?

So would the conclusion be that $R^2$ is meaningful if the modeled relationship between $Y$ and the predictors (whatever they may be (i.e., even if they're polynomials)) satisfies the regression assumptions?

If so, we would say: In the case of SLR, this forces the requirement of $X$ and $Y$ being linearly related, but for MLR, the relationship between $\textbf X$ and $Y$ may be curved, as long as the linearity assumption is met.

I can't create new tags due to my reputation being < 300, so if anyone would like to create R2, linearity and/or linearity-assumption, they seem reasonable. — Meg, Nov 01 '14 at 15:21
I added the tag for `[r-squared]`. There is a tag for `[linear-model]`, but note that this means *linear in the parameters* (ie, the parameters of the model are coefficients (see [here](http://stats.stackexchange.com/a/111304/7290)); "linearity" in the sense of *rectilinear* isn't really an assumption of regression. — gung - Reinstate Monica, Nov 01 '14 at 15:39
Thanks for adding r-squared. I was hoping to add a tag about linearity, indicating this problem deals with the linearity assumption (i.e., linear in the parameters) of regression. It also happens to deal with the "other" definition of linear, in that we need $X$ and $Y$ to be linear in order to meet the linearity assumption in SLR. I think adding the linear-model tag seems appropriate. — Meg, Nov 01 '14 at 15:57
One resolution of this issue is to note that $R^2$ is the square of the correlation coefficient between the fitted and actual values. This reduces the question to whether the predicted and fitted values appear to follow a linear relationship -- regardless of how nonlinear the relationship among the explanatory variables might be. Some people might find the illustrations I posted at https://stats.stackexchange.com/a/354256/919 to be helpful, too. — whuber, Jun 23 '21 at 16:55

score 1 · Answer 1 · answered Aug 30 '18 at 02:41

1

Coefficient of determination in multiple linear regression: Firstly, it is not correct to say $R^2$ is "meaningless" if it is zero. In a simple linear regression this just means that the response and explanatory vectors are uncorrelated, so it is very meaningful indeed. More generally, in multiple linear regression the coefficient-of-determination can be written in terms of the correlations for the variables using the quadratic form:

$$R^2 = \boldsymbol{r}_{\mathbf{y},\mathbf{x}}^\text{T} \boldsymbol{r}_{\mathbf{x},\mathbf{x}}^{-1} \boldsymbol{r}_{\mathbf{y},\mathbf{x}},$$

where $\boldsymbol{r}_{\mathbf{y},\mathbf{x}}$ is the vector of correlations between the response vector and each of the explanatory vectors, and $\boldsymbol{r}_{\mathbf{x},\mathbf{x}}$ is the matrix of correlations between the explanatory vectors (for more on this, see this related question). If the regression model form is correct then this statistic is meaningful, insofar as it gives a measure of goodness-of-fit which derives from the underlying correlations between the variables. If it is equal to zero then that gives a particular meaning, but it does not render the statistic meaningless.

Quadratic regression with no linear relationship: In the special case of the quadratic model you have specified, if there is no linear relationship, but there is a quadratic relationship then you have:

$$\boldsymbol{r}_{\mathbf{y},\mathbf{x}} = \begin{bmatrix} 0 \\ r_2 \end{bmatrix} \quad \quad \quad \boldsymbol{r}_{\mathbf{x},\mathbf{x}} = \begin{bmatrix} 1 & r_{1,2} \\ r_{1,2} & 1 \end{bmatrix}.$$

This gives you:

$$R^2 = \boldsymbol{r}_{\mathbf{y},\mathbf{x}}^\text{T} \boldsymbol{r}_{\mathbf{x},\mathbf{x}}^{-1} \boldsymbol{r}_{\mathbf{y},\mathbf{x}} = r_2^2 = \frac{(\sum (y_i - \bar{y}) (x_i^2 - \overline{x^2}))^2}{(\sum (y_i - \bar{y})^2) (\sum (x_i^2 - \overline{x^2})^2)}.$$

In this case the coefficient-of-determination is equal to the square of the correlation between the response vector $\mathbf{y}$ and the quadratic explanatory vector $\mathbf{x}^2$.

answered Aug 30 '18 at 02:41

Ben

91,027
3
150
376

I'm not sure what the first half of your answer is regarding, as my question doesn't purport that $R^2$ is meaningless if it is 0. This is a question about when the assumptions of $R^2$ are met, so that its value - whatever it may be (0 or otherwise) - may be trusted ("is meaningful"). As with any statistic, its estimated value is meaningless if its theoretical assumptions are not sufficiently met in practice. – Meg Aug 31 '18 at 14:06
Perhaps I've misunderstood what you intended, but your first sentence claims that for $R^2$ to be a meaningful measure there must be correlation between $X$ and $Y$. That is what I was responding to. The issue here is that there are no assumptions required for $R^2$ to be a well-defined and meaningful measure. If the regression assumptions are met then this quantity will have a particular stochastic *behaviour*, but regardless of this, it is a measure of the squared multiple correlation in the problem. – Ben Sep 01 '18 at 02:01
If you can be more specific about what behavioural properties you would like $R^2$ to have for this statistic to be "trusted" then perhaps we can assist in saying what assumptions are required for that behaviour. – Ben Sep 01 '18 at 02:02
Thanks for clarifying. My first sentence says, "For simple linear regression (SLR), in order for $R^2$...to be a meaningful measure, it must be true that $X$ and $Y$ are **linearly** correlated." The keyword is **linearly** - if they are not linearly related, this measure is not meaningful, because a value of 0 doesn't tell us there's no correlation, just no **linear** correlation ($X$ and $Y$ could be strongly related via a parabolic relationship, e.g., which this measure would miss). This question is about how that assumption translates into higher-level models. – Meg Sep 02 '18 at 15:58
Very simplistically: If the linearity assumption of a regression model (whether it be simple or multiple (including polynomial)) is met, does $R^2$ always have the same meaning? – Meg Sep 02 '18 at 16:05
Okay, so I think I see the problem here. You are using the term "linear correlation", which is a strange term to use because ---in standard parlance--- the correlation coefficient is *always* a measure of the linear component of a relationship. The meaning of $R^2$ relates to the measure of the relationship specified in the model, so if you have a parabolic relationship, but you fit it to a model with only a linear component, then the $R^2$ is going to be a measure that relates only to the linear approximation to that parabolic relationship. – Ben Sep 02 '18 at 22:43
*Pearson's* is a measure of linear correlation, but Spearman's, e.g., is not. Also, my question is not about a truly parabolic relationship incorrectly represented by a model with only a linear component. It's about a model that *correctly* has a quadratic term to account for the quadratic nature between the variables. Since the assumptions of the polynomial model are met, does $R^2$ have its standard interpretation as "the percentage of variation in $y$ accounted for by the predictors in the model"? That's my question: Does it matter what the model is as long as it's the correct model? – Meg Sep 06 '18 at 15:47

score 0 · Answer 2 · answered Jun 23 '21 at 15:40

The nonlinear terms are new variables.

You write your model as $\hat y = \hat \beta_0 + \hat \beta_1x+ \hat \beta_2x^2$. Instead, define $x_1 := x$ and $x_2 := x^2$. Now $\hat y = \hat \beta_0 + \hat \beta_1x_1+ \hat \beta_2x_2$, and that looks linear, doesn't it?

The math does not have to know how you got $x_2$. All the model cares about are the values of your features. One you measured; one you calculated. The model just sees numbers.

$R^2$ (coefficient of determination) and linearity in multiple linear regression

2 Answers2