Creating Correlations

Question

I have three variables, A, B, and C. It seems obvious that if X = A-B and Y = C-B that there should be a correlation between X and Y. When I've done this in matlab with random numbers, this seems to be the case with a mean r of about 0.5. The fact that there is a correlation makes intuitive since because X and Y are sharing the variance in B; but when I've tried to figure out why it's 0.5, I've had less luck. Any hints as to why this is the case would be greatly appreciated.

Have you tried expliciting calculating $$\text{Cor}(X,Y)=\text{Cor}(A-B,C-B)=...$$? — , Sep 05 '13 at 19:21
Not sure If I follow your questions. I've calculated three vectors of normally distributed random numbers, with a mean of 0 and std of 1, and calculated the equivalent of x and y. Having done that 1000 times, the mean r value is .4992. I've also done similar calculations with different means and stds (but not the extreme values Peter lists below). — user1589483, Sep 05 '13 at 19:33
You said that " This makes intuitive since, but when I've tried to figure out why this is the case, I've had less luck." so I was asking if you have actually done the paper and pencil calculations to see why the answer is the answer. — , Sep 05 '13 at 19:36
Other than running the correlations with random numbers, I haven't had luck trying to do any pencil and paper work with it. I've tried to come up with some way of looking at it with the equations for correlations, but perhaps the reason that I haven't had any luck is because of the point Peter made below. — user1589483, Sep 05 '13 at 19:44
Correlation is scaled covariance, which is easier to work with: $$ {\rm cov}(A-B,C-B) = {\rm cov}(A,C) - {\rm cov}(A,B) - {\rm cov}(B,C) + {\rm var}(B) $$ Lots of different things can happen. E.g., if your variables are standardized and $A,C$ are uncorrelated and $B,C$ and $A,B$ each had a correlation of .5 (yes, that's a positive definite covariance matrix), then ${\rm cov}(A-B,C-B) = 0$. You can also make then nearly perfectly correlated when $A$ and $C$ are independent of $B$ and $B$ is on a much larger scale than $A,B$... As is, this question seems too broad. What exactly are you asking? — Macro, Sep 05 '13 at 20:01
supposing that the std's of the three variables are on the same scale, why is it that r is about .5. — user1589483, Sep 05 '13 at 20:15
If you are generating independent standardized variables then, using the formula from my other comment, ${\rm cov}(A-B,C-B) = 1$. Correlation is defined as the covariance divided by the product of the standard deviations of the constituent variables. Each of $A-B$ and $C-B$ have standard deviation $\sqrt{2}$ (see, e.g. [this thread](http://stats.stackexchange.com/questions/31177/does-the-variance-of-a-sum-equal-the-sum-of-the-variances)). Therefore the correlation is $1/(\sqrt{2} \cdot \sqrt{2}) = 1/2$. I guess that's your answer. — Macro, Sep 05 '13 at 20:24
Suppose (perhaps unbeknownst to us) originally $A=X+B$ and $C=Y+B$ where $X,Y,B$ are *independent.* Then $X=A-B$ and $Y=C-B$ are as stipulated in the question but (assuming neither is constant) their correlation must be *zero* (not $1/2$) because $X$ and $Y$ are independent. Real insight into this question is afforded by pursuing @Macro's approach and computing $\text{Cov}(A-B,C-B)$ = $\text{Cov}(A,C)$ - $\text{Cov}(A,B)$ - $\text{Cov}(B,C)$ + $\text{Var}(B)$, etc. — whuber, Sep 05 '13 at 20:31

score 8 · Accepted Answer · 2013-09-06T04:37:18.660

8

So calculating this by hand we have the following (I will assume independence of $A,B$ and $C$):

\begin{align*} \text{Cor}(X,Y)&=\frac{\text{Cov}(X,Y)}{\sqrt{\text{Var(X)}\text{Var}(Y)}}\\ &\\ &=\frac{\text{Cov}(A-B,C-B)}{\sqrt{\text{Var}(A-B)\text{Var}(C-B)}}\\ &\\ &=\frac{E[(A-B)(C-B)]-E[A-B]E[C-B]}{\sqrt{[\text{Var}(A)+\text{Var}(B)]\times[\text{Var}(C)+\text{Var}(B)]}}\\ &\\ &=\frac{E[AC-AB -BC +B^2]-(E[A]-E[B])(E[C]-E[B])}{\sqrt{[\text{Var}(A)+\text{Var}(B)]\times[\text{Var}(C)+\text{Var}(B)]}}\\ &\\ &=\frac{E[AC]-E[AB] -E[BC] +E[B^2]-(E[A]-E[B])(E[C]-E[B])}{\sqrt{[\text{Var}(A)+\text{Var}(B)]\times[\text{Var}(C)+\text{Var}(B)]}}\\ &\\ &=\frac{E[A]E[C]-E[A]E[B] -E[B]E[C] +E[B^2]-(E[A]-E[B])(E[C]-E[B])}{\sqrt{[\text{Var}(A)+\text{Var}(B)]\times[\text{Var}(C)+\text{Var}(B)]}}\\ &\\ &=\frac{E[B^2]-(E[B])^2}{\sqrt{[\text{Var}(A)+\text{Var}(B)]\times[\text{Var}(C)+\text{Var}(B)]}}\\ &\\ &=\frac{\text{Var(B)}}{\sqrt{[\text{Var}(A)+\text{Var}(B)]\times[\text{Var}(C)+\text{Var}(B)]}}\\ &=\frac{1}{\sqrt{2\times2}}\\ &=\frac{1}{2} \end{align*}

edited Sep 06 '13 at 04:37

answered Sep 05 '13 at 20:22

1

+1 for the effort! You know you can use existing properties of covariance in your answers, right? ;) BTW, normality was not used anywhere in your answer. – Macro Sep 05 '13 at 20:27
Oh yes I do, but I wanted to make sure the OP understood doing everything by definition the long way. – Sep 05 '13 at 20:29
Did I need to use normality somewhere? – Sep 05 '13 at 20:36
No but you did specify that $A,B,C$ were normally distributed. – Macro Sep 05 '13 at 20:36
Oh that what just to show why I plugged in 0 and 1 where I did. I.e., I wanted to give a distribution to A,B and C so I could show what the expectation and variances were equal to, etc. I just chose standard normals because most of the `R` codes that came before used that. Could have been any distribution really. – Sep 05 '13 at 20:38
No biggie. Just pointing out that it seemed unnecessarily restrictive. The OP only specified that the variables were on the same scale (e.g. standardized). – Macro Sep 05 '13 at 20:41
I'd remove the normality. I think it confuses the issue – Glen_b Sep 06 '13 at 01:43
How would removing the normality influence things? – user1589483 Sep 06 '13 at 03:01
I got rid of the normality stuff. – Sep 06 '13 at 04:41
@BabakP, this derivation still assumes $E(A)=E(B)=E(C)=0$ and that all variances are equal but those assumption aren't stated anywhere. – Macro Sep 06 '13 at 15:03
I did assume all variances are equal, but I did not assume E(A) = E(B) = E(C) = 0. They all cancel out in the numerator if you factor out everything. Unless you are talking about something else @Macro. – Sep 06 '13 at 15:07
You're right. I didn't read carefully enough! – Macro Sep 06 '13 at 15:09
Does the assumption that all variances are equal influence anything other than the last two steps, where you assume the variance is 1? – user1589483 Sep 06 '13 at 16:33
Yes. Assume the variances were all different numbers, plug them into the above formula and you won't get 1/2. – Sep 06 '13 at 16:39
Ok, thanks. The main thing I was wondering is if the assumption of equal variances influences early aspects of the equation. – user1589483 Sep 06 '13 at 17:05
1

Probably not of interest, but ran a simulation actually calculating the correlation in matlab and using the equation above (with the modifications that @John mentioned below for correlations between variables). Mean r for actual correlations = .3923, mean r using equation was .3922. – user1589483 Sep 06 '13 at 17:09
Awesome, good to hear it validates. – Sep 06 '13 at 17:11

score 3 · Answer 2 · edited Sep 06 '13 at 14:20

3

It will depend on what the variables are. E.g. (R code, should be pretty clear though)

set.seed(1234)
A <- rnorm(100)
B <- rnorm(100)
C <- rnorm(100)

cor(A-C, B-C) #0.40

A <- runif(100)
B <- runif(100)
C <- runif(100)

cor(A-C, B-C)  #0.54

A <- rnorm(100,1,10)
B <- rnorm(100,10,100)
C <- rnorm(100,1,200)

cor(A-C, B-C) #0.89

And, of course, if A and B are related, you would get other values

edited Sep 06 '13 at 14:20

John

21,167
9
48
84

answered Sep 05 '13 at 19:22

Peter Flom

94,055
35
143
276

Good point, I've been using mean of 0 and std of 1. – user1589483 Sep 05 '13 at 19:30
Seems like the closer the two standard deviations of A and C are, the closer the mean of r is to .5 – user1589483 Sep 05 '13 at 19:46
1

It should be possible to derive the expected correlation given the distributions.... but it's not the sort of thing I know how to do! – Peter Flom Sep 05 '13 at 19:47

John · Answer 3 · 2013-09-06T14:10:13.667

Essentially what happens is that you have two sources of variance in your subtracted score and one of those sources is shared between each of the subtracted scores. When the initial random variables of A, B, and C are all independent but you then add in the variance from B to both A, and C then the proportion of variance that's shared is going to be your correlation. You had equal variances in each condition, that's why it's 0.5, half of the variance is shared.

Note that, if the correlation between A and B is 0 then the var(A-B) == var(A) + var(B). All of the following equations show similar results (I'll get to why not exact later).

a <- rnorm(10000, 0, 1)
b <- rnorm(10000, 0, 1)
var(a)
var(b)
var(a) + var(b)
var(a + b)
var(a - b)

So, your X, and Y variables you created include the variance from B. Correlations are about shared variance so your intuitions are correct.

I imagine that you've already done something like the following but perhaps through brute force. I'm using R and the mvrnorm function because it allows me to set my initial correlations to 0.

library(MASS) #so I can use mvrnorm and insure 0 correlation

# The following is a covariance matrix with the variance of each condition
# on the diagonal and the covariances among the conditions of of it.
sigma <- matrix(c(1.0, 0.0,  0.0,
                  0.0, 1.0,  0.0,
                  0.0, 0.0,  1.0), 3, byrow = TRUE)
mat <- mvrnorm(100, c(0,0,0), sigma, empirical = TRUE)
cor(mat)
a <- mat[,1]
b <- mat[,2]
c <- mat[,3]

x <- a - b
y <- c - b

cor(x, y)

You can see the final correlation of x and y is 0.5, just as you found. You can predict in advance what would happen if you changed the variances of the conditions (keep in mind variance is what matters here, not standard deviation). Let's say the variance of b was raised from 1 to 2. Now, the proportion of variance that comes from b in x is going to be var(b) / (var(a) + var(b)) or 2/3. That's going to be the same for y so the geometric mean shared variance is going to be equal to both of those and the correlation is 0.66 (2/3). To generate the data you'd just change the original variance-covariance matrix I made above and then proceed as above.

# note the variance of the second condition is now 2
sigma <- matrix(c(1.0, 0.0,  0.0,
                  0.0, 2.0,  0.0,
                  0.0, 0.0,  1.0), 3, byrow = TRUE)
mat <- mvrnorm(100, c(0,0,0), sigma, empirical = TRUE)
....

And, indeed, what happens is you get a correlation of 0.66. So, it's easy to work out what the correlation should be based on the initial variances. If you've got correlations among a, b, and c, and you almost always do, then it becomes a bit trickier. I imagine you've come across this in your simulations when you said that what you found was approximately 0.5. Sometimes your random variables happened to be correlated. If I take my current variables that are uncorrelated then the following will all produce the same results.

var(a) + var(b)
var(a - b)
var(a + b)

They all show the sum of the variances. But the equations are different if there is a correlation. The first one should be...

var(a) + var(b) - 2*cov(a,b)

And it would be the same as var(a) + var(b) when the correlation is 0. It also comes out the same as var(a - b). As an aside, you actually have to add the covariance term instead of the subtracting to match var(a + b).

So, hopefully that explains it in enough detail that you could derive the expected correlation of X and Y based on the covariances among A, B, and C, and their respective variances.

There's a graphical explanation using geometry that's very pretty but it escapes me right now. Also, this is excellent stuff to understand if you ever want to get a grasp on multiple regression (or even unbalanced ANOVAs).

Suppose the variance in A and C is not equal. How would one work out what the correlation would be based on the initial variance then? — user1589483, Sep 05 '13 at 21:51
sorry, I'm still not getting it yet. Maybe I've just been thinking about it too much today. — user1589483, Sep 06 '13 at 03:55
Here's a simple equation for when the initial correlations are 0. Let's say I made the initial variances a = 6, b = 2, c = 1.5. The proportion of shared variance calculation is still the same for x where the proportion shared variance is var(a)/(var(a) + var(b)) or 6/8. For y it would be 2/3.5. To find out their correlation you have to get the geometric mean of those shared variances sqrt(2/8 * 2/3.5). (You'd have to correct the initial shared variances by the covariance if the a, b, c values were correlated.) — John, Sep 06 '13 at 14:12
You could try it by applying my calculation to Pete Flom's last example and you'll get the correct answer. (remember to change the SD's to variances). — John, Sep 06 '13 at 14:20

score -2 · Answer 4 · answered Sep 05 '13 at 20:43

-2

Nope. Let $B=C$, let $A=B$, or let $C=A$.

answered Sep 05 '13 at 20:43

AdamO

52,330
5
104
209

1

Could you further elaborate? – Sep 05 '13 at 20:49
If $B=C$, then $Y=0$ and $\mbox{cor}(X, 0) = 0$. Stating the assumptions behind this problem *would* have given the OP some intuition about his question. – AdamO Sep 05 '13 at 20:56
If $Y=0$ then $\text{cor}(X,0)$ is undefined, not 0. – Sep 05 '13 at 21:01
Sorry, I meant that to be `cov`. Nonetheless, stating some assumptions behind this will help everyone understand the problem. – AdamO Sep 05 '13 at 21:03
I agree with that statement – Sep 05 '13 at 21:05
+1. Generally, I find simple counterexamples to be far more revealing than, say, ten lines of obscure equations or several dozen lines of code. – whuber Sep 06 '13 at 15:47

Creating Correlations

4 Answers4