1

I need to simulate three variables $A,B,P$ ~ $N(0,1)$ such that the Pearson correlations $r_{AB}=\operatorname{cor} (A,B)$ and $r_{BP}=\operatorname{cor}(B,P)$ are given. I need to repeat the simulation multiple times in order to generate several values for $r_{AP}=\operatorname{cor}(A,P)$ and to observe a sample distribution for it, that's why I am not fixing it. We want to see if under the simple assumptions that all the variables are ~ $N(0,1)$ we can somehow "estimate" $r_{AP}$ given the other two correlations.

Questions concerning simulations of variables with given correlation have already been addressed multiple times. But none of the suggested methods works well for me, because:

  1. For using the Cholesky decomposition matrix (as explained here) I have to fix the correlation between $r_{AP}$, which is exactly what I don't want to do;
  2. If instead I use some pairwise correlation simulation (as in here), you can analitically show that the expected value of the sample distribution for $r_{AP}$ is exactly $r_{AB} * r_{BP}$, which takes away any predictive power of the simulation.

Does anybody have an idea on how to implement such a simulation? I am not familiar with copula, but I have the feeling it could be a starting point.

** EDIT **

I have been asked several times to clarify my question. I hope this works.

The question is if, just by supposing that you have A,B,P~N(0,1) distributed random variables with pairwise given correlations between two pairs of them (not all three), and no more information at all, you can derive a distribution for the missing correlation. I do not want a sample distribution for the random variables themselves, but for the missing correlation (if there is one at all; I am not even sure about this point). That's the pseudo-code:

1. Generate A,B,P ~ N(0,1) s.t. cor(A,B) and cor(B,P) are given
2. Calculate cor(A,P)
3. Repeat 1. and 2. several times (say, 500) and get a distribution for cor(A,P).

The problem is, the ways I have been simulating until now (see the two links above) introduce some structure in the data (in particular: P and A get to be linearly dependent) and practically nullifies the predictive power I am hoping to get from the simulation. But there might be no way at all to create artificial data with "partial structure" (distribution and two pairwise correlations assigned) and still hoping for the third correlation to be, to a certain extent, independent from this structure.

jeiroje
  • 31
  • 8
  • 3
    Several of your premises / assertions are incorrect. For example, (i) you state you can analytically show that "$r_{AP}=r_{AB} \cdot r_{BP}$", but this claim is not true in general, only under particular conditions; (ii) you can fix a population correlation and still "observe a sample distribution". You seem to have a number of misconceptions/misunderstandings. Could you step back from your attempted solutions and explain more clearly what you need to achieve? – Glen_b Jun 11 '15 at 10:56
  • Hi @Glen_b, I updated the question, thanks for pointing it out. Regarding (i): could you please send me a link to that? I used basic probability definition for proving that, but I might have overlooked something. Regarding (ii): I meant the sample distribution of the correlations themselves. I updated it now, it should be clearer. – jeiroje Jun 11 '15 at 13:05
  • 2
    In respect of (i), why not give your proof? Note that correlation is not probability; Explicit bounds for the correlation in terms of the other two (and another quantity the bounds depend on) are given [here](http://stats.stackexchange.com/questions/122888/how-to-infer-correlations-from-correlations/124909#124909). In respect of (ii) my comment seems to be unaffected by your change, it still seems to apply -- you can fix a population correlation and the sample correlation still varies around it. – Glen_b Jun 11 '15 at 15:03
  • If you don't want to fix $r_{AP}$, the joint distribution of $A$, $B$, and $P$ is not completely specified. Do you have some additional constraints that might fill the gap? @Glen_b, to be fair, the poster states that this can be analytically shown for a specific simulation approach described under a link, not necessarily in general. – A. Donda Jun 11 '15 at 15:54
  • 2
    @A.Donda yes, you're correct -- but then one must wonder what relevance that simulation has to the actual problem the OP has; I'd like it to be clearer. – Glen_b Jun 11 '15 at 16:13
  • @Glen_b, agreed. – A. Donda Jun 11 '15 at 16:20
  • @Glen_b now I see what you mean. Yes, you are absolutely right, in general that's the furthest you can go with correlations. But if you simulate A, B and P such as B~N(0,1), A=$r_{AB}*B+sqrt(1-r_{AB}^2)*a$ with a~N(0,1) and P in a similar way as A (as stated in the linked page), then A and P are linked by a somehow linear relationship (+noise) and you can actually do the computation. – jeiroje Jun 12 '15 at 18:28
  • @A.Donda I do not have any other constraint, as we have not much information concerning the real mechanisms underlying to our data. That's why we need to simulate and we can not apply any theoretical arguments. – jeiroje Jun 12 '15 at 18:28
  • 2
    jeiroje - certainly if you add some additional structure that's not in your question (as with the simulation), that alters the things that can happen. I'm still not clear on which situation you really want to know about. – Glen_b Jun 13 '15 at 00:38
  • @Glen_b the additional structure added was actually linked in the post, nevertheless I made it more explicit now. – jeiroje Jun 13 '15 at 14:47
  • Yes it was in the linked page -- but it wasn't clear that was the intended structure you wanted to ask about. – Glen_b Jun 13 '15 at 14:57

1 Answers1

0

Maybe this little "proof" does clear some things for you. I especially think that you do not quite understand the correlation coefficient entirely, but thats just my guess and I don't want to offend you.

First let me remind you that the correlation coefficient is a measure of the linear dependency of two independent random variables X and Y.

So lets look at the formula for the covariance and the correlation coefficient:

Cov(X,Y) = E[(X-E[X])(Y-E[Y])

$r_{XY} = \frac{Cov(X,Y)}{\sqrt{Var(X)}\sqrt{Var(Y)}}$

Since X,Y ~ N(0,1) we get: $Cov(X,Y) = r_{XY}$

So now lets get to linear regression:

$Y = \beta_{0} + \beta_{1}X + \epsilon$

Notice that the $\beta_{0}$ intercept is cancelled out, because of X,Y ~ N(0,1), we get:

$Y = \beta_{1}X + \epsilon$

Now note that the slope $\beta_{1}$ is defined as: $r_{XY} * \frac{\sqrt{Var(Y)}}{\sqrt{Var(X)}}$, since X,Y ~ N(0,1):

$\beta_{1} = r_{XY}$ = Cov(X,Y)

So now lets move on to your example:

What you want to fix is:

$A = r_{AB}B + \epsilon_{AB}$

$P = r_{BP}B + \epsilon_{BP}$

What you want to "estimate" is

$r_{AP}$, which is equal to Cov(A,P) since A,P ~ N(0,1) as shown before.

The Covariance formula can be rewritten as:

Cov(X,Y) = E[XY] - E[X]*E[Y]

So we get $E[AP] - \underbrace{E[A]}_{0}\underbrace{E[B]}_{0}$, again since A,B ~ N(0,1)

Cov(A,P) = E[AP]

Lets plug our two equations for A and P in there:

$E[AP] = E[(r_{AB}B + \epsilon_{AB})(r_{BP}B + \epsilon_{BP}])]$

$E[AP] = E[r_{AB}B^{2}r_{BP}] + \underbrace{E[r_{AB}B\epsilon_{BP}]}_{0} + \underbrace{E[\epsilon_{AB}Br_{BP}]}_{0} + \underbrace{E[\epsilon_{AB}\epsilon_{BP}]}_{0}$, since we assume uncorrelated residuals $\epsilon$ (crucial regression assumption)

$E[AP] = r_{AB}r_{BP}E[B^{2}]$, because $E[r_{AB}] = r_{AB}$ and $E[r_{BP}] = r_{BP}$

$E[AP] = r_{AB}r_{BP}\underbrace{E[B^{2}]}_{1}$, because B ~ N(0,1)

so we finally get $E[AP] = r_{AB}r_{BP}$, which can be rewritten as:

$r_{AP} = r_{AB}r_{BP}$

So now you also should get an idea of when this not holds true. For example if the residuals $\epsilon$ are not uncorrelated (In reality they are almost never entirely uncorrelated).

Maybe @Glen_b♦ can check this answer for mistakes or advance further on situations, when this does not holds true, because he has far more knowledge than I have.

jannic
  • 184
  • 1
  • 13
  • thanks for the answer. Unfortunately the output is not much different from mine: the product $r_{AB}*r_{BP}$ is still the best estimator for the expected value of $r_{AP}$, even though in this case I can not exactly prove why (I think the argument is still basically the linear relationship between A and P, just I don't know the details) – jeiroje Jun 14 '15 at 09:10
  • So do you wan't to prove it? I think I could write you that prove down. Let me try to explain what I understand that you wan't to do just to be sure: You stated in your edit that: 1. Generate A,B,P ~ N(0,1) s.t. cor(A,B) and cor(B,P) are given. Do you want cor(A,B) and cor(B,P) to be empirical (exact) correlations like I set them in the previous sample code? so that cor(A,B) = 0.5 and cor(B,P) = 0.2 in every new reptition? Or should they be sample correlations which are ~0.5 and ~0.2? – jannic Jun 14 '15 at 09:58
  • Also does this related question help you? http://stats.stackexchange.com/questions/5747/if-a-and-b-are-correlated-with-c-why-are-a-and-b-not-necessarily-correlated – jannic Jun 14 '15 at 09:59
  • I just updated the answer, hope this clears some things for you, mabye @Glen_b ♦ could advance further on that? – jannic Jun 15 '15 at 14:29
  • Since you want to fix the correlation coefficients $r_{AB}$ and $r_{BP}$, A and P can always be expressed as a linear combination of B. What do you mean by "another way"? Be aware that however you simulate linear correlated variables, whether by cholesky, eigen, linear combination, angles between two vectors etc. the outcome will always be linear dependent variables, and you can always write them as linear combinations of each other. Don't think of the correlation coefficient as a magic number which appears from somewhere, as I have shown you above the relationship to linear regression. – jannic Jun 16 '15 at 21:00
  • Yes, but this is a _simulation_ problem, as we all know from real data that the relationship between correlations is all but straightforward. I was wondering if there is a way to simulate A and P so that they maintain a certain "freedom degree" from each other and their correlation is not directly computable via theoretical arguments (btw, your proof is also valid without the linear regression, if you simply apply the definitions from probability theory). – jeiroje Jun 19 '15 at 09:30
  • Could you provide that proof with definitions from probability theory in your question? I'm sure it would help people to get an idea what you are asking, also I'm very interested in it! – jannic Jun 27 '15 at 15:22
  • sorry for the delay in accepting your answer. Yes, with a similar proof as you shown above, and by speaking with some statisticians, we realized that there was no way we could simulate three variables without artificially introducing a linear dependence. We had to proceed in a different way, but it was good to see that it was not a poorely written simulation but an actual theoretical limitation. Thanks again! – jeiroje Jul 18 '16 at 07:48