I am concerned with simulating data for a linear regression model. I need to control the means, variances, and correlations (covariances) between the predictors and the criterion variable. In addition, I need to be able to vary the explained variances ($R^2$). It is obvious to me that the latter must be a function of the earlier, so at least one correlation (covariance) in $\Sigma$ is perhaps dependent on the choice of $R^2$, where $\Sigma=E((Y,X)(Y,X)^T)$, for centered $X$ and $Y$, is the variance-covariance matrix of all variables.
My plan is thus as follows:
- Specify $\Sigma$, means, and $R^2$
- Simulate data with these sufficient statistics, e.g. by sampling from the multivariate normal
- Check estimated $\beta$ (regression coefficient) vector against population (theoretical) coefficients and use the model for unrelated tests/science.
Hence, my approach does not suggest specifying $\beta$ but letting the coefficients be a function of population $\Sigma$, means, and $R^2$. The reason I need to do this is to attribute some realisitc scale to $X$ and $Y$ (e.g., let $Y$ assume an 'income' scale and give $X$ a realistic scale for years of eduction). Therefore, I specify the sufficient statistics instead of regression coefficients. But maybe there is a better way.
Moreover, I have two specific questions:
Given the population variance-covariance matrix $\Sigma$ of one criterion variable $Y$ and a series of predictors (covariates) $X$ , I would like to calculate the vector of true population regression coefficients. Of course, I could simulate data $X$ and $Y$ and use the OLS estimator, but there should be a direct way to use $\Sigma$ in the estimation of population $\beta$?
Which options are there to specify covariances (correlations) in $\Sigma$ given I need a fixed $R^2$ of a linear regression of $Y$ on $X$? This, to systematically vary the explanatory power of the regression model.