simulating data with a lot of predefined constraints in R

Question

how to simulate data (in R) to generate , sample values 1) variables with specific correlation values for a particular model AND 2) with predefined regression coefficients? 3) Can we also set the mean and SD in the same process? 4) Also how does one simulate the p value/significance of the variable.

This is for imitating existing models for analysis and teaching purposes

Sorry for not being specific : this is for multiple regression, sample values. I would like to specify the mean and SD if possible (apparently not, I can specify only one in order to specify the regression coefficients?)

Thanks for the help.

Do you mean simple regression or multiple regression? Do you mean specific *population* values or specific *sample* values for correlation, and the regression coefficients? Did you want to specify mean and SD of the DV or the IV or both? (If both then you won't be free to choose the regression coefficients) Please clarify your question. — Glen_b, Mar 02 '15 at 06:58
"I would like to specify the mean and SD" ... I ask again, of what, exactly? The y? The x's? both? — Glen_b, Mar 02 '15 at 07:09
We are trying to replicate an existing regression model, mainly to show how the relationship changes when variables are added stepwise, moderators and to explain how the significance of the coefficients changes. Since the model is based out of a theory we heavily draw upon, we thought it would be quite useful to replicate the exact model and sample parameters and play around with the underlying data. — kristen, Mar 02 '15 at 08:27
also does this change if I would like to specify the means and/or SDs of Y as well? — kristen, Mar 02 '15 at 09:30

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

4

1) For the predictors (independent variables, x-variables) only, you want to:

specify sample means, sds, and correlation matrix

this is equivalent to specifying the covariance matrix and the means

2) you want to specify sample coefficients in the regression

You can also specify the residual standard deviation, which will relate to whichever p-value you're interested in.

Step 1 is already addressed in a number of posts on site, such as

a) Generating data with given sample covariance matrix

(A covariance matrix scaled to have unit variances is a correlation matrix, so that works for both)

b) Tool for generating correlated data sets

There's some mention of R code in at least one of those.

2)

a) put the desired coefficients in $\beta$

b) simulate random normal errors

c) regress the errors on the x's and find the residuals

d) scale the residuals to the desired standard deviation (call the result $r$)

e) calculate $y=X\beta+r$

That's it.

There's a few extra tidbits in this post on simulating ANOVA that carry over to the regression case.

If you want to determine one of the p-values you can back out the required residual variance that would give that p-value from the other statistics.

edited Apr 13 '17 at 12:44

Community

1

answered Mar 02 '15 at 08:16

Glen_b

257,508
32
553
939

I see your posts have a lot of useful information. Thanks. On a side note, do you have any suggestions on books/online resources for understanding and getting hands on with R simulations?. – kristen Mar 02 '15 at 08:34
Not really. If the simulations aren't very complex, in R simulations are often easiest via `replicate`. Is there something specific you needed to know about? – Glen_b Mar 02 '15 at 09:14
thanks glen..sorry about that I finally got my accounts merged. To repeat what I have asked before --- also does this change if I would like to specify the means and/or SDs of Y (outcome variable/dependent) as well? – kristen Mar 13 '15 at 14:40
They're implied by the other inputs. If you choose the intercept, $\beta_0$ consistent with (not equal to!) the mean you want, you can get the mean of $Y$. If you specify the SD of the error term then the conditional SD of Y follows. If you mean the unconditional SD of Y, that is a function of the conditional SD of Y and the $X$-matrix and $\beta$ vector. – Glen_b Mar 13 '15 at 14:48
Thanks again. And what if the same question was for population values for correlation? How dos that change. We are just trying to understand. Thanks – kristen Mar 16 '15 at 12:37
The link for step (a) already describes how to do both population and sample versions for correlation. – Glen_b Mar 16 '15 at 13:57
thanks a lot glen, I am working on this. Just wanted to check if there is a post out there that helps with the p-value simulation. I just want to see how I can specify the residual standard deviation which will give the desired p-value for each X. Its probably trivial but I am have no idea on this one part. – kristen Mar 18 '15 at 09:34
I am unable to do the c) d) e) steps you mentioned and the p-vaule specification. I am unable to find help on this site too. Is there some R code link that you can think of. That will be of great help! thanks – kristen Mar 18 '15 at 11:25
Let's look at those one at a time. part (c) is "regress the errors on the x's and find the residuals". Is it (i) performing regression or (ii) finding residuals that's causing you difficulties? – Glen_b Mar 18 '15 at 13:17
thanks for the reply! this is what I did : mat[] has my simulated X's;; e id 0.8 is the rsquare i want? – kristen Mar 18 '15 at 13:25
You just jumped to (d), but you said above that you couldn't do (c). I don't want to move there until it's clear that (c) is solved. Are you happy with (c)? Note that if `mat` is a data frame rather than a matrix, you could do the regression in (c) as `lm(e~.,mat)` – Glen_b Mar 18 '15 at 14:02
sorry, I did jump. I think I am now okay with e – kristen Mar 18 '15 at 14:07
In (d) you just multiply by the standard deviation of the residuals (`r=e*s`, say), though you can put that in `e` if you want. – Glen_b Mar 18 '15 at 14:14
okay. how does one determine what standard deviation to use here? I haven't actually decided on that. will this impact the p-values later? sorry, if this is trivial. – kristen Mar 18 '15 at 14:31
Certainly it impacts the p-values. – Glen_b Mar 18 '15 at 14:32
in that case, since e) is straightforward can we focus on this "you can back out the required residual variance that would give that p-value from the other statistics". If I want, say, x1 to show a certain p-value and x2 to show another p and so on ..how do I fix this residual S.D. (on a side note, thanks for your patience and I hope these extended comments are okay.) – kristen Mar 18 '15 at 14:49
In that part you quote I was assuming you were talking about the p-value for an overall F-test. That's relatively easy. The individual p-values depend in a complicated way on the variance of the x's, their correlations and on the variance of the y's. In fact, extended discussion in comments is discouraged, which is why the system was asking you to take it to chat. – Glen_b Mar 19 '15 at 00:35
again, thanks. Is it possible to fix the p-value of individuals x's? or atleast a range of values/ approximate values for the x's? – kristen Mar 19 '15 at 02:10
You can manipulate the p-values for the $x$'s by changing the slope coefficient while holding other things constant, or by changing the s.d.s of the $x$'s while holding the corresponding slope constant ... (i.e. by changing either the numerator or the denominator in the standardized slope) but when you do that, you may affect other things you want to specify. You can also change them all up or down together by manipulating $s$. I'll think some more on this; it may be best to work with some orthogonal basis and then build a dependent set of $x$'s but I haven't worked the p-value algebra through – Glen_b Mar 19 '15 at 05:14
seems complicated. meanwhile, I will work with atleast what I have got uptil now, and I will get back here for more advice. (appreciate all the help you have extended. thanks) – kristen Mar 19 '15 at 06:25
another question, - instead of 'exact' p-values if I would like to show that ,for example, x1 is significant at p<.05 ...="" absence="" and="" at="" demonstrate="" easier="" it="" mainly="" make="" of="" p="" significance="" that="" the="" to="" would="" x2=""> – kristen Mar 19 '15 at 09:16
I'd expect it to be substantially easier to get them into ranges, yes. – Glen_b Mar 19 '15 at 10:22
Glen specifying t-value instead of p-value , would that make it any easier? as I am more interested in the presence or absence of a significant relationship than a particular p-value? any ideas? thanks – kristen Mar 22 '15 at 06:55
Since the t-value (given the sign of the coefficient and the d.f., which we can take to be fixed) is directly obtainable from the p-value, there's no additional benefit in specifying the t-value. If you know either one you know the other. – Glen_b Mar 22 '15 at 07:36
thanks glen. Another question and hopefully I wont have more : When I have interaction variables how does it change? can I just mention yhat – kristen Mar 23 '15 at 12:09
In that case the constant probably isn't $1$ in general. You can calculate interactions just like anything else (consider if I said $x_3=x_1\times x_2$ and put that in instead) -- but then you don't control the dependence between it and other variables, it follows from that of its main effects. – Glen_b Mar 23 '15 at 12:16

simulating data with a lot of predefined constraints in R

1 Answers1

Linked