How different are fixed score and random score regression estimates of population r-square?

Question

If I understand his point correctly, in answer to a previous question @StéphaneLaurent has highlighted the point that the value of population variance explained of a linear regression (i.e., $\rho^2$) depends on whether you see the predictors as fixed or random. From what I can tell, the literature refers to this distinction by different names including fixed score versus random score regression (e.g., Smithson, 2001) or sometimes as the "fixed-x assumption" (e.g., Aldrich, 2000).

Thus, in a fixed score regression there is a single $\rho^2$, which we might denote $\rho^2_f$. In a random score regression the $n \times p$ predictor data $X$, where $n$ is the sample size and $p$ is the number of predictors, is assumed to be drawn from a $p$-dimensional distribution. Thus, in the random score regression, there is a $\rho^2$ given $n$ and the sampled predictor values, which we can denote $\rho^2_i$. Finally, there is the variance explained were an infinite amount of data sampled both from the predictors and the outcome variable, which I'll denote $\rho^2_a$.

I assume that as sample size increases in a random score regression, the sample led predictor values will more closely match the underlying predictor distribution. As such the variance of $\rho^2_i$ across different samples should get smaller. Presumably also, there may be a point where the variance of $\rho^2_i$ gets sufficiently small that for practical purposes, the distinction between fixed score regression and random score regression becomes unimportant.

I also assume that confidence intervals around $\rho^2_a$ will be wider than those around of $\rho^2_f$ because there with random score models there is an additional source of variability. Thus, I'm curious both about how researchers interested in random score regression estimate this additional source of variability. I'm also interested in what sample size is required before the distinction is no longer practically important.

Questions

How does sample size relate to the importance of the distinction between fixed score and random score regression?
Is there a sample size at which the variance in $\rho^2$ estimation differs minimally across fixed score and random score regression?
Are there any methods for estimating the additional source of variance related to random score regression in estimating $\rho^2$?
Is there any published research on these topics?

References

Aldrich, J. (2000). The origins of fixed X regression. PDF
Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational and Psychological Measurement, 61(4), 605-632.

I think these are two different contexts hence there's no sense to compare. Imagine you design the experiment (that is, you choose the matrix $X$), then you cannot be interested in a quantity assuming a random $X$. By the way I don't see one could define the population r-squared with random $X$ in case of a model with non-numerical predictors (such as a one-way ANOVA model). — Stéphane Laurent, Jul 03 '13 at 11:42
@StéphaneLaurent Personally, I am not interested in experimental contexts. I'm interested in contexts in observational studies with numeric predictors. I'm interested in the differences between assuming such data are the only predictor values of interest, versus acknowledging the that the predictor values are drawn from a distribution. — Jeromy Anglim, Jul 03 '13 at 12:12
I think it is not possible to construct a "purely" unconditional confidence interval without making a precise assumption about the distribution of the covariates. But any valid conditional interval is valid unconditionally too. — Stéphane Laurent, Jul 03 '13 at 16:04

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

Ok so let's try to give the definitions.

We assume the model $$y_i = \beta_1x_{i1} + \ldots + \beta_p x_{ip} + \epsilon_i$$ with the vectors of covariates $(x_{i1}, \ldots, x_{ip})$ independently generated from a multivariate normal distribution. Therefore, unconditionally, the $y_i$ are iid too, and it makes sense to consider the variance $Var(y_i)$.

Denote $\sigma^2_y=Var(y_i)$ and $\sigma^2=Var(\epsilon_i)$.

Denote for clarity $y_i^{\text{obs}}$ and $x_{ik}^{\text{obs}}$ the observed data, and set $\mu_i=\beta_1x_{i1} + \ldots + \beta_p x_{ip}$ and $\mu^{\text{obs}}_i=\beta_1x^{\text{obs}}_{i1} + \ldots + \beta_p x^{\text{obs}}_{ip}$ (though the $\mu_i$ are not observable).

By the conditional variance formula, $\boxed{\sigma^2_y=\tau^2+\sigma^2}$ where $\tau^2=Var(\mu_i)=\beta'\Sigma\beta$.

Then according to my answer, the population r-squared when predictors are considered as fixed is $$\rho^2_f = 1 - \frac{\sigma^2}{\frac{variation(\mu_i^{\text{obs}})}{n}+\sigma^2}$$.

Whereas the population r-squared when predictors are considered as random is $$\rho^2=1-\frac{\sigma^2}{\sigma_y^2}.$$

So basically you are interested in the approximation $$\frac{variation(\mu_i^{\text{obs}})}{n} \approx \tau^2.$$

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

This answer on adjusted r-square formulas in fixed and random-x settings reports how the standard Ezekiel formula is an estimate of fixed-x $\rho^2$ and the Olkin and Pratt formula is an estimate of random-x $\rho^2$. Thus, to the extent that these formula provide reasonable approximations, then examination of these formulas should provide some insight to the questions:

How do estimates differ? Leach and Hansen (2003) report present a nice table showing the effect of different formulas on a sample of different published datasets in psychology (see Table 3). The mean Ezekiel $R^2_{adj}$ was .2864 compared to Olkin and Pratt $R^2_{adj}$ of .2917 and Pratt $R^2_{adj}$ of .2910. As per Kromrey's initial quotation about the distinction between fixed and random-x formulas being most relevant to small sample sizes, Leach and Hansen's table shows how the difference between Ezekiel's fixed-x formula and Olkin and Pratt's random-x formula is most prominent in small sample sizes, particularly those less than 50.

Thus, to answer the question:

As sample size increases, the difference between fixed-x and random-x estimates from data decreases.
Some have suggested the difference with n=50 is pretty small, but this ultimately depends on the precision you care about. For example, Here are some examples to give a flavour of the difference
- n=1261 and 4 predictors Ezekiel was .0269 and Olkin-Pratt was .0270 (difference of .0001)
- n = 187 and 11 predictors, Ezekiel was .1710 and Olkin-Pratt was .1727 (i.e., a difference of .0017).
- n = 62 and 3 predictors Ezekiel was .2007 and Olkin-Pratt was .2073 (i.e., a difference of .0066).

References

Leach, L. F., & Henson, R. K. (2003). The use and impact of adjusted R2 effects in published regression research. In annual meeting of the Southwest Educational Research Assocation, San Antonio, TX. PDF

How different are fixed score and random score regression estimates of population r-square?

Questions

References

2 Answers2

References

Linked