As part of a simulation study, I would like to create multivariate data that follow a specific covariance matrix. In this study I would like to be able to show that my algorithm is able to find highly correlated data between two datasets. Also, I would like to include collinearity, in a form of defining the number of explanatory and response variables that correlate with each other.
Hereby the R code for the problem:
library(MASS)
nr_individuals = 2000 #nr of observations
high.nr.x = 4 #nr of highly correlated variables in X - (explonatory)
high.nr.y = 6 #nr of highly correlated variables in Y - (response)
highCorBetween = 20 #correlation between the highly correlated variables
lowCorX = 3 #correlation between variables in X
lowCorY = 1 #correlation between variables in Y
#SIMULATE CORRELATED COLUMNS
total.cols = high.nr.x+high.nr.y
#make correlation matrix
CorCols = diag(total.cols)
CorCols[,] = highCorBetween #insert highly correlated variables
CorCols[,] = CorCols[,] - sample(-3:3, total.cols^2, replace=TRUE) #make data a more realistic
#introduce negative correlation
negatives <- sample(length(CorCols), total.cols/2, replace=TRUE)
for(e in negatives)
CorCols[e] = CorCols[e] * -1
#put correlation within X in covariance matrrix
CorCols[1:high.nr.x,1:high.nr.x] = lowCorX
#correlation within Y
CorCols[(high.nr.x+1):(total.cols),(high.nr.x+1):(total.cols)] = lowCorY
#put diagonals on 100
diag(CorCols) = 100
#make covariance matrix symmetric
CorCols.u <- CorCols
CorCols.u[upper.tri(CorCols, F)] = 0
CorCols.l <- t(CorCols.u)
CorCols = CorCols.l + CorCols.u
diag(CorCols) = 100
#CorCols = CorCols/100
#generate a high correlated matrix
#CorCols.m = rmvnorm(n = nr_individuals, mean = rep(0,high.nr.x+high.nr.y), CorCols)
CorCols.m = mvrnorm(n = nr_individuals, mu = rep(0,high.nr.x+high.nr.y), CorCols)
#*****************************
CorCols
cov((CorCols.m))
This produces the following covariance matrix:
> CorCols
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 100 3 3 3 22 -17 18 23 18 -17
[2,] 3 100 3 3 22 -23 19 21 18 17
[3,] 3 3 100 3 22 20 22 23 22 17
[4,] 3 3 3 100 19 21 17 22 23 19
[5,] 22 22 22 19 100 1 1 1 1 1
[6,] -17 -23 20 21 1 100 1 1 1 1
[7,] 18 19 22 17 1 1 100 1 1 1
[8,] 23 21 23 22 1 1 1 100 1 1
[9,] 18 18 22 23 1 1 1 1 100 1
[10,] -17 17 17 19 1 1 1 1 1 100
>
> cov((CorCols.m))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 102.488311 6.478105 3.544439 1.235431 25.93322414 -19.06442718 15.5860641 20.1082330 19.4726431 -16.3888262
[2,] 6.478105 104.079934 5.955973 4.596520 23.09775179 -24.61767577 18.9976206 26.5669763 20.3534901 18.8990530
[3,] 3.544439 5.955973 108.301280 1.307604 23.65638341 24.55354464 24.8988210 22.7098385 23.1617868 16.9418012
[4,] 1.235431 4.596520 1.307604 98.914409 18.31694271 19.58798486 18.5691044 24.0341383 20.1565850 18.4877265
[5,] 25.933224 23.097752 23.656383 18.316943 102.52463872 0.08230676 2.3322417 2.9476010 3.2918859 1.5124856
[6,] -19.064427 -24.617676 24.553545 19.587985 0.08230676 100.48101959 2.1288953 0.2182186 2.9399330 1.3138298
[7,] 15.586064 18.997621 24.898821 18.569104 2.33224168 2.12889526 96.6045039 2.2321378 0.3256926 2.1856988
[8,] 20.108233 26.566976 22.709839 24.034138 2.94760099 0.21821861 2.2321378 102.9477339 2.0669838 0.5530647
[9,] 19.472643 20.353490 23.161787 20.156585 3.29188587 2.93993300 0.3256926 2.0669838 101.4802341 1.3955998
[10,] -16.388826 18.899053 16.941801 18.487726 1.51248555 1.31382981 2.1856988 0.5530647 1.3955998 98.7497249
However, if I change the high correlation to value .8, that is
highCorBetween = 80 #correlation between the highly correlated variables
I get the error message of "'Sigma' is not positive definite":
CorCols
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 100 3 3 3 83 79 83 79 79 77
[2,] 3 100 3 3 83 -81 77 78 77 82
[3,] 3 3 100 3 80 80 78 80 78 80
[4,] 3 3 3 100 83 80 79 79 79 82
[5,] 83 83 80 83 100 1 1 1 1 1
[6,] 79 -81 80 80 1 100 1 1 1 1
[7,] 83 77 78 79 1 1 100 1 1 1
[8,] 79 78 80 79 1 1 1 100 1 1
[9,] 79 77 78 79 1 1 1 1 100 1
[10,] 77 82 80 82 1 1 1 1 1 100
> CorCols.m = mvrnorm(n = nr_individuals, mu = rep(0,high.nr.x+high.nr.y), CorCols)
Error in mvrnorm(n = nr_individuals, mu = rep(0, high.nr.x + high.nr.y), :
'Sigma' is not positive definite
I tried to change the tolerance parameter in mvrnorm() function, for example tol=1, but that would only solve the problem of getting an error, the covariance matrix still wouldn't look like that resembles the model matrix (correlation in-between variables for both X and Y increase hugely):
> CorCols
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 100 3 3 3 83 79 83 79 79 77
[2,] 3 100 3 3 83 -81 77 78 77 82
[3,] 3 3 100 3 80 80 78 80 78 80
[4,] 3 3 3 100 83 80 79 79 79 82
[5,] 83 83 80 83 100 1 1 1 1 1
[6,] 79 -81 80 80 1 100 1 1 1 1
[7,] 83 77 78 79 1 1 100 1 1 1
[8,] 79 78 80 79 1 1 1 100 1 1
[9,] 79 77 78 79 1 1 1 1 100 1
[10,] 77 82 80 82 1 1 1 1 1 100
> CorCols.m = mvrnorm(n = nr_individuals, mu = rep(0,high.nr.x+high.nr.y), CorCols, tol =1)
> #*****************************
>
> cov((CorCols.m))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 129.92109 25.83637 40.40527 35.12365 50.61760 54.51044 48.49376 49.51910 46.92675 45.92999
[2,] 25.83637 135.09280 26.79751 24.19305 54.26442 -77.79020 47.86738 52.82103 55.25732 54.82464
[3,] 40.40527 26.79751 140.75243 44.46343 53.67213 61.96498 53.79998 52.06622 52.12298 49.71284
[4,] 35.12365 24.19305 44.46343 135.61404 51.25497 60.02818 49.28677 47.03124 48.73891 52.63492
[5,] 50.61760 54.26442 53.67213 51.25497 129.96128 15.43909 19.99275 26.70164 26.35219 24.13483
[6,] 54.51044 -77.79020 61.96498 60.02818 15.43909 125.73316 16.46839 12.46468 10.06989 13.02532
[7,] 48.49376 47.86738 53.79998 49.28677 19.99275 16.46839 119.76993 28.25129 25.61649 25.96538
[8,] 49.51910 52.82103 52.06622 47.03124 26.70164 12.46468 28.25129 119.88476 23.82021 24.48806
[9,] 46.92675 55.25732 52.12298 48.73891 26.35219 10.06989 25.61649 23.82021 126.18663 24.85941
[10,] 45.92999 54.82464 49.71284 52.63492 24.13483 13.02532 25.96538 24.48806 24.85941 123.65103
Is there any workarounds for this problem? That is, how could I generate data that has a covariance matrix that follows closely the model matrix of a few highly correlated variables between X and Y yet having low correlation between variables in X and between variables in Y separately (such as the one denoted CorCols in the last code sample).
Any suggestions and references are welcome. Thank you!