12

I am writing a report and am unsure about whether is correct to say one 'measures' r-squared or whether one 'estimates' it. I know the two words have two different semantic meanings, probably related to whether you are identifying the 'true' value or not, but as a non-statistician I am finding it hard to decide which is more appropriate.

Or perhaps neither is suitable and a different word is better.

I have searched around a fair amount and couldn't find an obvious answer.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
user438383
  • 223
  • 1
  • 7

3 Answers3

16

The coefficient of determination $R^2$ is the square of the multiple correlation coefficient (see this related answer), which is a function of the sample coefficients between the variables. Consequently, it is something that you "measure" or "compute" from the data, rather than something you estimate. However, it is possible to define an analogous value for the square of the true multiple correlation coefficient and treat the coefficient of determination as an estimate for this, so you could reasonably say that you "measure" the coefficient of determination, but you "estimate" the underlying value of the square of the true multiple correlation.


Here is a more formal version of this breakdown. Suppose we first define the true correlation values and sample correlation values (using the Pearson coefficient) for all the variables in the problem. We will label the true correlation values $\rho_i = \mathbb{Corr}(Y,X_i)$ and $\rho_{i,j} = \mathbb{Corr}(X_i,X_j)$ and the sample correlation values $r_i = \mathbb{Corr}(\mathbf{y},\mathbf{x}_i)$ and $r_{i,j} = \mathbb{Corr}(\mathbf{x}_i,\mathbf{x}_j)$, where the latter denote sample correlation between observed vectors of values. Now define the true version and sample version of the goodness of fit vector and design correlation matrix respectively by:

$$\mathbf{GOF} \ \mathbf{vector} \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf{DC} \ \mathbf{matrix} \quad \quad \quad \quad \\[12pt] \boldsymbol{\rho}_{\mathbf{y},\mathbf{x}} = \begin{bmatrix} \rho_1 \\ \rho_2 \\ \vdots \\ \rho_m \end{bmatrix} \quad \quad \quad \boldsymbol{\rho}_{\mathbf{x},\mathbf{x}} = \begin{bmatrix} \rho_{1,1} & \rho_{1,2} & \cdots & \rho_{1,m} \\ \rho_{2,1} & \rho_{2,2} & \cdots & \rho_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{m,1} & \rho_{m,2} & \cdots & \rho_{m,m} \\ \end{bmatrix}, \\[40pt] \boldsymbol{r}_{\mathbf{y},\mathbf{x}} = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_m \end{bmatrix} \quad \quad \quad \boldsymbol{r}_{\mathbf{x},\mathbf{x}} = \begin{bmatrix} r_{1,1} & r_{1,2} & \cdots & r_{1,m} \\ r_{2,1} & r_{2,2} & \cdots & r_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ r_{m,1} & r_{m,2} & \cdots & r_{m,m} \\ \end{bmatrix}.$$

The parameter version and sample version of the coefficient of determination are then given by:

$$\begin{matrix} \text{Regression model parameter (unnamed)} \quad \quad \quad \quad \quad \phi^2 = \boldsymbol{\rho}_{\mathbf{y},\mathbf{x}}^\text{T} \boldsymbol{\rho}_{\mathbf{x},\mathbf{x}}^{-1} \boldsymbol{\rho}_{\mathbf{y},\mathbf{x}}, \\[6pt] \text{Coefficient of Determination} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad R^2 = \boldsymbol{r}_{\mathbf{y},\mathbf{x}}^\text{T} \boldsymbol{r}_{\mathbf{x},\mathbf{x}}^{-1} \boldsymbol{r}_{\mathbf{y},\mathbf{x}}. \\[6pt] \end{matrix}$$

Now, the value $R^2$ is a statistic that can be computed from the sample, whereas the parameter $\phi^2$ is an unobservable aspect of the regression model that can only be estimated. We can of course use the coefficient of determination $R^2$ to estimate the unknown parameter $\phi^2$.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • This sounds like a distinction without a difference to me. Statistics is full of messy and inconsistent terminology. Are you describing it, defending it or both? – Nick Cox Sep 08 '21 at 17:54
  • 13
    @Nick I see a very real difference: the "computation" is a statistic whereas the target of the "estimate" is a parameter of the model. All three--the statistic, the statistic *qua* estimator, and the parameter--can be called a "coefficient of determination." This is perfectly in keeping with the ambiguous use of other terms like "mean" and "standard deviation" in similar situations. – whuber Sep 08 '21 at 18:20
  • 2
    Just to elaborate slightly on @whuber s trichotomy: I'm more familiar with them being called estimates, estimators, and estimands, respectively. – user551504 Sep 08 '21 at 19:05
  • 1
    I have added a new section to this answer to be more explicit about the parameter and statistic in the present case. – Ben Sep 08 '21 at 21:48
  • +1, though missed chance of defining the estimand ${\it Ρ}^2$ (capital $\rho$) instead of $\phi^2$ :) – Firebug Sep 08 '21 at 21:57
  • 2
    @Firebug: I think the problem here is that capital-rho (in Greek) uses an identical symbol to capital-p (in English). It is usual to use unambiguously Greek symbols for model parameters, so I have chosen to switch to $\phi^2$ for this purpose. Capital-rho is also rather fraught with peril in a statistical context, since capital-p is often used to denote a probability value or probability measure. – Ben Sep 09 '21 at 01:24
  • 1
    That's just me being pedantic :P. It's the same glyph, yes – Firebug Sep 09 '21 at 13:53
7

Although some theory relates $R^2$ to a certain "true" value, it is mostly not used and interpreted as an estimator of that value, but rather as a characteristic of an empirical fit, and as such I'd say it is "computed" and not "estimated". I wouldn't say "measured" either (which goes against my intuition but I have difficulties explaining why), although one could probably defend both "measured" and "estimated" if necessary.

Christian Hennig
  • 10,796
  • 8
  • 35
  • 1
    This seems to hinge on how words are used in practice, which is fine by me, but your opening also weakens your case. – Nick Cox Sep 08 '21 at 17:52
  • I agree about "measured". Like "estimated" it suggests that there is something apart from the data that you are attempting to get at. – John Coleman Sep 09 '21 at 12:11
2

To me, "estimates" sounds weird because the r-squares is not a representation of some hidden object. "Measures" could be okay but I don't hear that so often. I would prefer "calculates" or "computes".

Kota Mori
  • 514
  • 2
  • 4
  • 4
    Sounds weird is not much of a reason. – Nick Cox Sep 08 '21 at 17:51
  • 3
    Re "not a representation:" On the contrary, $R^2$ is a definite property of the underlying population or model. – whuber Sep 08 '21 at 18:22
  • @whuber: That assumes the distribution of predictors $X$ is part of the model. But a regression model is often taken as only specifying the conditional distribution of $Y\mid X$, and in that case $R^2$ is not a property of the model. So I would prefer to say "calculates" – kjetil b halvorsen Sep 10 '21 at 15:23
  • @kjetil No such assumption is necessary. Either the distribution of the regressors is part of the model or the regressors are *fixed.* It doesn't matter: $R^2$ remains a property of the data generating function. – whuber Sep 10 '21 at 17:51