Simulating data based on correlation coefficient?

Question

If a numerical vector $\vec{x}$ is a sample drawn from a normal distribution, given a correlation coefficient $\rho$ is there a way to simulate a second vector $\vec{y}$ such that the $corr(\vec{x},\vec{y})=\rho$ ?

My hunch is that $\vec{x}$ would be multiplied by a sample of equal length from a random uniform distribution centered around 1 and bound by some interval determined by the value of $\rho$, but I'm just guessing.

X would be a sample draw from a normal distribution, sorry I should have specified. I've updated the question to reflect that. — SubstantiaN, Sep 16 '18 at 01:52
Your notation isn't clear to me -- do you require to specify the *sample* correlation or the *population* correlation? (In either case, the question is already answered on site several times, but I need to know what to close it as a duplicate of, unless I locate one that answers both) — Glen_b, Sep 16 '18 at 05:34
@Glen_b Yes, you are correct, I was asking about sample correlation. Your answer on the other post addresses my question. What is the preferred practice here, should I delete this question? — SubstantiaN, Sep 16 '18 at 15:05
No, left undeleted it serves a useful function -- people searching for an answer that use search terms that turn up your question but not that one will find an answer. — Glen_b, Sep 16 '18 at 15:08
@Glen_b Is there a name for the method you applied in answering the other post and/or a reference you would recommend for citation? I would post this comment to your original answer, but I don't have sufficient rep points to post there. — SubstantiaN, Sep 16 '18 at 15:23
do you mean the method used in part (1) there (the method for population correlation) or the slight tweak in (2) to get zero sample correlation, or the other slight tweak in (2) to get the sample standard deviation to be 1? ... I can't say I have a reference (though certainly many will exist); each of the steps are obvious enough once you know a little basic statistical theory. I think I may have seen the first method in an exercise once but I couldn't say for sure. The second I don't recall seeing before I first wrote it down but I've seen many people do it since so it was already well known, — Glen_b, Sep 16 '18 at 21:41
... consequently my guess is I probably at least heard of it rather than rediscovered it independently, but if so I couldn't say where. The algebraic notions behind the general version of the method in 1 just relies on a few well known properties (linear combinations of multivariate normals are normal, linearity of expectations, variance of $AX$ is $A\, \text{Var}(X)\, A'$, and then you just need a way of finding an $A$ such that $\Sigma=AA'$, which is where Choleski decomposition comes in, being one simple way to achieve that; I never remember how it works, I just rederive it when I need it). — Glen_b, Sep 16 '18 at 21:52
I expect you'd find at least the method in (1) of that post in any decent book on simulation methods. The method in (2) is simply a tweak to make the sample have the desired properties rather than the population — Glen_b, Sep 16 '18 at 21:57

score 1 · Answer 1 · answered Sep 16 '18 at 05:10

1

There is a relationship in bivariate normal distributions: $Y|X=x \sim \mathsf{Norm}(\rho x, \sqrt{1-\rho^2}).$

Implementing this in R, with $\rho = 0.8),$ we have the following code, where the number of variables y generated must be the length of x.

set.seed(2018)
x = rnorm(200, 50, 1)
y = rnorm(200, .8*x, sqrt(1-.8^2))
plot(x,y, pch=20)    
cor(x,y)
[1] 0.8016411

Of course $X$ and $Y$ are samples so you cannot expect the sample correlation $r$ to be exactly $\rho = 0.8.$ Large samples tend to have $r$ closer to $\rho$ than small ones.

answered Sep 16 '18 at 05:10

BruceET

47,896
2
28
76

would you have the derivation of this bivariate relationship ? – Xavier Bourret Sicotte Sep 18 '18 at 17:44
Wanted to link to @Glen_b's [Answer](https://stats.stackexchange.com/questions/111865/tool-for-generating-correlated-data-sets), now noted above, which I had seen before; but couldn't find it at that time. **_Recommend you use that._** // Otherwise, note standard result: If $X,Y$ bivar norm with corr $\rho,$ $\mu_X=\mu_Y=0,$ and $\sigma_X = \sigma_Y = 1,$ then marginals are std norm and cond'ls are $X|Y=y \sim \mathsf{Norm}(\rho y, \sqrt{1-\rho^2})$ and $Y|X=x \sim \mathsf{Norm}(\rho x, \sqrt{1-\rho^2}).$ Look at math stat book for whole story. – BruceET Sep 18 '18 at 19:49

Simulating data based on correlation coefficient?

1 Answers1