Why is R-squared equal to the sum of standardized coefficients times the correlation?

Question

Reading about standardized coefficients I came across the following formula: $$R^2=\sum\beta_ir_{yi}$$ Where $\beta$ is the standardized coefficient for the independent variable $i$ and $r_{yi}$ is the correlation coefficient between y and the viariable $i$.

Even though this sounds pretty intuitive, I have not been able to find the formal demonstration or derive the formula myself.

The correlation coefficient of what variables in what sense? In this context, the distinction between the usual bivariate Pearson correlation (which is irrelevant) and the *partial correlation* is huge. — whuber, Nov 26 '19 at 13:48
@whuber: It is the usual Pearson correlation between response and covariable. It is true: the R-squared is equal to the scalar product of coefficients and "usual" correlations, given all involved variables have unit variance. — Michael M, Mar 10 '21 at 20:39
@Michael That's a nice way to put it and points out this is a nice relationship to remember, thank you. — whuber, Mar 10 '21 at 20:52
I have come across an old thread at https://stats.stackexchange.com/questions/124887/ which treats the case of two variables in detail. — whuber, Mar 16 '21 at 16:49

whuber · Answer 1 · 2021-03-12T14:25:30.147

I offer two solutions. The first (geometric) solution reverses the usual least-squares perspective by starting with its solution--the fitted values--and working backwards to the problem(!), which exposes the basic nature and simplicity of this result. The second (algebraic) solution cranks the standard least-squares machinery to show how the result can be obtained in a straightforward manner using familiar formulas that simplify when the variables are first standardized.

For the cognoscenti, I will summarize the first solution to spare you the effort of reading through it. The ordinary least squares solution orthogonally projects the response vector $y$ into the subspace generated by the explanatory variables $x_i$. This means the projection can be expressed as a linear combination. The $\beta_i$ are its coefficients (I won't use hats here, since we will never refer to a "true model"). When the response and the explanatory variables are initially standardized, $R^2$ is just the squared length of the projection, $|\hat y|^2.$ That length in turn is the inner product of the response vector with its projection (because the residuals are orthogonal to the projection). Computing that inner product term by term introduces the inner products of $y$ with the $x_i$--but because these vectors have all been standardized, those inner products are just the correlation coefficients $r_{yi}.$ The equation $R^2 = \beta_1 r_{y1} + \beta_2 r_{y2} + \cdots$ drops right out.

Geometric solution

To serve as a counterpoint to the heavy linear algebra in the second solution, this discussion will not be any more advanced (mathematically) than the basics of analytic geometry taught in high schools for generations. I will, however, freely use the (modern) terminology of "vector spaces," "linear combination," "inner products," "orthogonality," and "linear forms" that is often avoided at the most elementary level, to make the connections between the two solutions more apparent.

Forget about regression for a moment and just suppose you are presented with a vector $\hat y.$ (As the notation suggests, $\hat y$ eventually will play the role of a least squares prediction--but so far we don't have anything to predict!) Choose, in any way you please, any number of (even infinitely many) vectors $x_1,x_2,\ldots$ with which $\hat y$ may be represented as a linear combination. That is, suppose there are scalars $\beta_i$ for which

$$\hat y = \beta_1 x_1 + \beta_2 x_2 + \cdots.$$

Now let $\phi$ be any linear form, also known as a covector. By definition, this means only that $\phi$ is a linear function defined on the vector space, with scalar values, for which

$$\phi(\hat y) = \phi(\beta_1 x_1 + \beta_2 x_2 + \cdots) = \beta_1\phi(x_1) + \beta_2\phi(x_2) + \cdots.\tag{1}$$

Believe it or not, that is our result! It remains only to apply it in the special case of ordinary least regression with standardized variables.

Because this is a least squares setting, our vector space is endowed with a Euclidean norm $|\ |$ giving the lengths of vectors (as a root sum of squares--that's where least squares comes into the picture) and its associated inner product $\langle\ ,\ \rangle$ for which $|x|^2 = \langle x,x\rangle$ for any vector $x.$ This inner product provides a splendid way to obtain linear forms. Namely, given any vector $y,$ define the function $y^{*}$ via

$$y^{*}(x) = \langle y, x\rangle.$$

Because the inner product is bilinear, $y^{*}$ is automatically linear, whence it is a linear form.

The term "standardized coefficient" in the question is conventional, but it's misleading: it's not the coefficient that has been standardized; it means the coefficient is obtained by first standardizing all the variables involved. So, let us restrict the forgoing discussion to unit vectors $x_i,$ which means $|x_i|=1,$ and let $y$ be an arbitrary unit vector (not, apparently, having anything whatsoever to do with $\hat y$ and the $x_i$).

In this case, where $\phi = y^{*},$ the basic relation $(1)$ is

$$y^{*}(\hat y) = \langle y, \hat y\rangle = \beta_1 \langle y, x_1\rangle + \beta_2 \langle y, x_2\rangle + \cdots.\tag{2}$$

The penultimate step is to suppose the scalars are real numbers and that the components of all vectors sum to zero. In this case, the inner products in the preceding sum are correlation coefficients:

$$\langle y, x_i\rangle = r_{yi}$$

(using the notation of the question). This is because the correlation coefficient of two vectors is defined as the sum of products after the vectors have been recentered (to make their components sum to zero) and normalized to unit length. For more about correlation from this perspective see Freedman, Pisani, & Purves, Statistics (any edition), a classic introductory (almost formula-free) textbook.

At some point we need to introduce $R^2.$ For this purpose I propose a general definition that reduces to the usual one in the least squares setting.

Definition: Given a nonzero vector $y$ and any vector $\hat y,$ let $$R^2(\hat y, y) = \frac{|\hat y|^2}{|y|^2} = \left(\frac{|\hat y|}{|y|}\right)^2.$$ It is the square of the ratios of the lengths of these vectors.

In any regression, no matter how it may be performed, when $\hat y$ is the regression estimate of $y$ this formula exhibits $R^2$ as the "regression sum of squares" ($|\hat y|^2$) divided by the "total sum of squares" ($|y|^2$). Usually $R^2$ is computed after centering $y$ (when the model contains an intercept), but it is often computed and reported even when $y$ is not centered ("regression through the origin"). For a good discussion of this, see Removal of ... intercept term increases $R^2$.

In this generality all we can say is that $R^2$ is not negative--but it could be arbitrarily large. That is about to change. But, in passing, observe that when $|y|=1,$ the formula simplifies to $$R^2(\hat y, y) = |\hat y|^2 / |y|^2 = |\hat y|^2.$$

Finally suppose that $y-\hat y$ is orthogonal to $\hat y.$ This is geometric language for stating

$$0 = \langle y - \hat y, \hat y\rangle = y^{*}(\hat y) - |\hat y|^2.$$

This connects the value of the form $y^{*}$ at $\hat y$ to the (squared) length of $\hat y:$ the two must be equal.

Applying this observation to $(2)$ and using the notation $r_{yi}$ gives

$$R^2(\hat y, y) = |\hat y|^2 = y^{*}(\hat y) = \beta_1 r_{y1} + \beta_2 r_{y2} + \cdots\tag{3}$$

So far everything has been about simple (almost trivial) relations among vectors in an inner product space. But geometrically, that's all least squares is: given a response vector $y$ and a collection of explanatory vectors $x_1,x_2,\ldots$ (in the same vector space as $y,$ of course), the Normal Equations of least squares theory assert that a least squares solution $\hat y$ is any linear combination of the $x_i$ whose residual is orthogonal to it:

$$\langle y - \hat y, \beta_1 x_1 + \beta_2 x_2 + \cdots\rangle = 0.$$

That was our final supposition above, which implied relation $(3),$ and we are done.

Algebraic solution

The question concerns regression statistics developed from a model matrix $X$ and response variable $y$ that have all been normalized: that is, the sums of all columns are zero, the sums of their squares are all constant $C\ne 0,$ and any constant column has been removed from $X.$ ($C$ varies depending on whether one is using Maximum Likelihood estimates, Ordinary Least Squares estimates, or whatever, but it will turn out its actual value doesn't matter.)

Because of these normalizations, some of the (usual) formulas simplify, including

$$(r_{y1}, r_{y2}, \ldots, r_{yp})^\prime = r(X,y) = \frac{1}{C} X^\prime y$$

is the vector of correlation coefficients between $Y$ and the columns of $X$ and

$$y^\prime y = C$$

is the total sum of squares, $TSS.$

Two useful formulas (which don't simplify) are

$$\hat\beta = (X^\prime X)^{-}X^\prime y$$

for the (standardized) regression coefficient (which estimate the true coefficients $\beta$) and

$$SSR = \hat y^\prime \hat y = (X\hat\beta)^\prime (X\hat\beta) = y^\prime X(X^\prime X)^{-}X^\prime y$$

for the "regression sum of squares."

Since $R^2$ is defined as the ratio of the regression sum of squares to the total sum of squares,

$$R^2 = \frac{SSR}{TSS} = \frac{y^\prime X(X^\prime X)^{-}X^\prime y}{C} = y^\prime X(X^\prime X)^{-}\left[\frac{1}{C}\, X^\prime y\right] = \hat\beta^\prime r(X,y).$$

In non-matrix form this latter expression is the sum (over $i$) of $\hat\beta_{i}r_{yi},$ QED.

How do you define SST or TSS: like 1) SSR+SSE or 2) $\sum (Y_i - \bar Y)^2$? What is the meaning of $X'$ is this similar matrix of $X$, is $-$ in $(X^\prime X)^{-}$ short notation of the inverse matrix? — Easy Points, Mar 11 '21 at 00:10
@Easy All these notations are standard, except perhaps the superscript "$-$": that refers to a [generalized inverse.](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse) It acknowledges that $X^\prime X$ might not be invertible, but nevertheless there is a rigorous way to compute the analog of an inverse that makes the formula work out. — whuber, Mar 11 '21 at 13:19
Since you are answering just a single question: I don't understand what is $X^\prime$ still in here, but I guess it is the same as transposed matrix $X^T$? Please confirm. — Easy Points, Mar 11 '21 at 14:01
Also how the dataset of features $X$ has been normalized, and same for $y$, they are different types of normalizations. Right? You can normalize the data to have the mean of 0 for instance, that is one, and second to normalize the data to have z-distribution (even more rigorous). — Easy Points, Mar 11 '21 at 14:05
@MichaelM i strongly disagree with you. The notation used in here is not part of this [ultimate reference](http://www2.imm.dtu.dk/pubdb/edoc/imm3274.pdf) and haven't found the clue to understand the meaning of $X^\prime$ for instance. The terms used are not clearly explained, so from the perspective of students not so readable. — Easy Points, Mar 11 '21 at 19:10
@Easy Maybe it's time to look at a multiple regression textbook. Over the last 40 years a standard terminology has developed to discuss multiple regression, as illustrated not only in this thread but in many thousands of threads here on CV. It would be impossible to answer most questions by not resorting to this notation and terminology: the notation grows ornate and the formulas explode in complexity. It would also be conceptually poorer, because the matrix formulation, once understood, makes it (almost) as easy to understand the multiple variables case as the ordinary one-variable case. — whuber, Mar 11 '21 at 20:25
@whuber I don't make any compromise when I try to understand something, and in here I was not able to dupe the conclusion on my end. I am very interested in $R^2$ and matrix factorization and a bit in regression. Your answers can receive many more points if they are approachable to me. You are after all writing answers for peoplez like me. I recognize the tendency when all the good info is hidden. Here is an example of [not hidden](https://datascienceplus.com/understanding-the-covariance-matrix/). I planned to use your post and to rewrite with some more tips on covariance, eigenvalues etc. — Easy Points, Mar 11 '21 at 20:53
So I checked the book from John I. Marden (Multivariate Statistics Old School) since this was available as PDF and I can confirm this book haven't defined the term ${}^'$ although is uses it extensively. My question still remains what is $X^\prime$ and is there any question on this website that explains what I need. PS. I don't won't to purchase books just to find what the notation means. Thanks. — Easy Points, Mar 12 '21 at 00:29
@Easy The fact that standard textbooks don't define this notation is a clear indication that it is universally understood, much like "$\times$" and "$+$" are. It sounds like a quick look at a linear algebra textbook might be helpful to you. — whuber, Mar 12 '21 at 13:09
@Easy I added a solution that might be more to your liking. Thank you for prompting the reflections that went into its development. — whuber, Mar 12 '21 at 14:26
Why I read your updated post why this guy is [saying](https://youtu.be/blyXCk4sgEg?t=870) $R^2$ should be less then 1, and 0 if we predict the mean. And it can be negative as well. — Easy Points, Mar 12 '21 at 15:33
OK, I saw the [link](https://stats.stackexchange.com/questions/26176/) which gives some clue on the $R^2$ it was a premature issue, but let the comment remain because it is a good link. — Easy Points, Mar 12 '21 at 16:20
@whuber I read your improved post and it is detailed and made for *cognoscenti* I cannot understand it still. But since you put that much effort into it I will +1 you. I usually like like the [books](https://www.deeplearningbook.org/contents/notation.html) where there is notation part inside. And re the $X^\prime$ I don't understand it still but I will eventually in the near future. Good luck. — Easy Points, Mar 12 '21 at 16:27

Why is R-squared equal to the sum of standardized coefficients times the correlation?

1 Answers1

Geometric solution

Algebraic solution

Linked