2

Suppose we have real data $X$, $Y_1$, $Y_2$, $Y_3$. They are all of length $n$, positive, continuous, and are correlated with each other.

How can I simulate a continuous random variable $Z$ so that $cor(Z,X)>0$ but $cor(Z,Y_i)=0$ for all $i$?

If the data follows a multivariate normal distribution, we can work out the conditional (normal) distribution of $(Z|X,Y_1,Y_2,Y_3)$ with our desired property. But if the real data is not normal, what can I do?

(I tried proceeding as if they are normal but the resulting correlation matrix is no where near what I want)

dsaxton
  • 11,397
  • 1
  • 23
  • 45
qoheleth
  • 1,210
  • 1
  • 12
  • 25
  • 2
    Are you referring to the sample correlations (as I'd expect when referring to *data*, which you mention) or population correlations (as I'd expect when discussing variables, which you also mention, and use notation to suggest)? For example, is it the population correlation between $Z$ and $Y_i$ or the sample correlation (between a vector $z$ and a vector $y_i$) that you want to be zero? .... keeping in mind that the resulting sample correlation will differ from the population correlation. – Glen_b Nov 07 '16 at 01:03
  • 1
    Good point. Sample correlation is what I am after. – qoheleth Nov 07 '16 at 02:32
  • $Z=X$ non singular, $Y_i \sim$ iid anything is one such way. – AdamO Nov 09 '16 at 20:30
  • 1
    See also a related question http://stats.stackexchange.com/q/15011/3277 – ttnphns Nov 10 '16 at 07:51

1 Answers1

2

As Glen B. noted, it is unclear whether or not you are interested by the population or the sample.

At the population level, we can work with random variables $X,Y_1,Y_2,Y_3,Z$. Then if we define$$Z=a_0X+\sum_{i=1}^3 a_i Y_i$$ we have that \begin{align} \text{cov}(Z,X)&=a_0\text{var}(X)+\sum_{i=1}^3 a_i \text{cov}(X,Y_i)\\ \text{cov}(Z,Y_1)&=a_0\text{cov}(X,Y_1)+\sum_{i=1}^3 a_i \text{cov}(Y_1,Y_i)\\ \text{cov}(Z,Y_2)&=a_0\text{cov}(X,Y_2)+\sum_{i=1}^3 a_i \text{cov}(Y_2,Y_i)\\ \text{cov}(Z,Y_3)&=a_0\text{cov}(X,Y_3)+\sum_{i=1}^3 a_i \text{cov}(Y_3,Y_i) \end{align} which defines a system of four equations with four unknows, $a_0,a_1,a_2,a_3$, hence should have a single solution for almost every specification of the variances and covariances involved. Note that you can add an independent variable $W$ to the above combination $$Z=a_0X+\sum_{i=1}^3 a_i Y_i+W$$and recover the same property.

At the sample level, each term of variance or covariance in the rhs of the equations must be replaced by its empirical version. Here is an illustration:

first, simulating correlated vectors for illustration

x=rnorm(10)
y1=.1*x+rnorm(1)
y2=.2*x+rnorm(10)
y3=.3*x+rnorm(10)
y1=.1*x+rnorm(10)
#creating the data matrix
datum=cbind(x,y1,y2,y3)

...then computing covariances and solving equations, which are the two only lines you need for real data

coef=solve(cov(datum),c(.5,0,0,0))  #(.5,0,0,0,0) is an example
#deriving the proper linear combination
z=datum%*%coef

and finally checking all is fine

> cov(z,x)
     [,1]
[1,]  0.5
> cov(z,y1)
             [,1]
[1,] 6.911187e-17
> cov(z,y2)
             [,1]
[1,] 1.061494e-16
> cov(z,y3)
             [,1]
[1,] 1.054085e-16

And if you want to add noise to the linear combination, you simply need to account for it in the resolution, as follows

epsilon=rnorm(10) #any arbitrary random vector would do
coef=solve(cov(datum),as.vector(c(.5,0,0,0)-cov(epsilon,datum))
z=datum%*%coef+epsilon
Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • This is great and almost there, except the desired property is at the *correlation* level instead of the covariance level. Do you think you can modify your post? – qoheleth Nov 16 '16 at 00:48
  • The same property holds for the correlations (zero versus non-zero). If you want specific correlations, you can adjust the variance of the added noise `epsilon` to reach those correlations. – Xi'an Nov 16 '16 at 07:52