7

I start with a vector of random numbers sampled from a normal distribution:

R<-rnorm(100, mean=0, sd=30)

I would now like to create 3 variables that are correlated with each other with a pre-specified correlation. In addition I would like to have these three variables correlated with R with a pre-specified correlation.

For example A, B, C would have correlation 0.7 with each other, and A, B, C would have correlation 0.6 with R.

I.e. I am looking for the following covariance matrix:

    R   A   B   C
R   1  .6  .6  .6
A  .6   1  .7  .7
B  .6  .7   1  .7
C  .6  .7  .7   1

How can this be done in R?

Comp_Warrior
  • 2,075
  • 1
  • 20
  • 35
PLS
  • 173
  • 1
  • 4
  • 2
    This can be done using the `mvrnorm` function from the `MASS` package. The essential options are `mu` which is a vector of the means and `Sigma` which is the covariance matrix. – COOLSerdash Sep 05 '13 at 09:29
  • 1
    Along the same lines as COOLSerdash, there is a function called `rmvnorm()` in the `mvtnorm` package as well. –  Sep 05 '13 at 14:46
  • 1
    @COOLS (and BabakP): the challenge here is to create A, B, and C *conditional* upon R. `mvrnorm` will not (directly) do that. – whuber Sep 05 '13 at 16:17
  • @Whuber, if you create R (along with A,B,C) in the `rmvnorm()` command instead of using `rnorm()` you can do it directly. See my answer below. –  Sep 05 '13 at 18:20
  • 2
    I am making a distinction here @Babak. This question is worded in such a way that $R$ is given first and then $A,B,C$ need to be constructed ("I start with ... [and] now like to create..."). If that is indeed the case, then your solution does not work. If it is not the case, and the question merely is how to generate realizations from a multivariate normal, then you have answered it--and we would need to migrate the thread to SO where it will be on topic (as a pure programming question). PLS, we would appreciate your clarification of what you seek. – whuber Sep 05 '13 at 20:13

2 Answers2

2

You could do something like this:

library(mvtnorm)

x = 3.3
sig = matrix(c(30,x,x,x,x,1,.7,.7,x,.7,1,.7,x,.7,.7,1),nrow=4)

X = rmvnorm(100,mean=rep(0,4),sigma=sig,method="svd")
round(cor(X),2)

Y = rmvnorm(10000,mean=rep(0,4),sigma=sig,method="svd")
round(cor(Y),2)

Not the structure of the covariance matrix:

> sig
     [,1] [,2] [,3] [,4]
[1,] 30.0  3.3  3.3  3.3
[2,]  3.3  1.0  0.7  0.7
[3,]  3.3  0.7  1.0  0.7
[4,]  3.3  0.7  0.7  1.0

Now, as you can see with only 100 samples the calculated correlation is:

> X = rmvnorm(100,mean=rep(0,4),sigma=sig,method="svd")
> round(cor(X),2)
     [,1] [,2] [,3] [,4]
[1,] 1.00 0.70 0.59 0.67
[2,] 0.70 1.00 0.72 0.73
[3,] 0.59 0.72 1.00 0.74
[4,] 0.67 0.73 0.74 1.00

and with 100,000 samples the calculated correlation is:

> Y = rmvnorm(100000,mean=rep(0,4),sigma=sig,method="svd")
> round(cor(Y),2)
     [,1] [,2] [,3] [,4]
[1,]  1.0  0.6  0.6  0.6
[2,]  0.6  1.0  0.7  0.7
[3,]  0.6  0.7  1.0  0.7
[4,]  0.6  0.7  0.7  1.0

1

I don't know R, so here are verbal directions.

Step 1. Let $U$ be the upper Cholesky factor of the correlation matrix.
For the correlation matrix you give, $U$ is

1  .6  .6       .6
0  .8  .425     .425
0   0  .677772  .235145
0   0   0       .635674

Step 2. Let $X$ be an $n \times 4$ matrix in which the first column is your given vector R,
with the three other columns filled with random independent standard normals.

Step 3. Subtract its mean from each column of $X$, then get the QR decomposition of the result.
If $R_{1,1}$ is negative then change the signs of all four values in row 1 of $U$.

Step 4. Let $Y = Q\,U \sqrt{n-1}$. If all you want is standard scores then you're done. Otherwise you can replace the first column of $Y$ by your given vector R, then for each of the three other columns, multiply by the desired standard deviation and add the desired mean.

Ray Koopman
  • 2,143
  • 12
  • 6