3

As part of a simulation study, I would like to create multivariate data that follow a specific covariance matrix. In this study I would like to be able to show that my algorithm is able to find highly correlated data between two datasets. Also, I would like to include collinearity, in a form of defining the number of explanatory and response variables that correlate with each other.

Hereby the R code for the problem:

  library(MASS)


  nr_individuals  = 2000    #nr of observations
  high.nr.x       = 4       #nr of highly correlated variables in X - (explonatory)
  high.nr.y       = 6       #nr of highly correlated variables in Y - (response)
  highCorBetween  = 20      #correlation between the highly correlated variables
  lowCorX         = 3      #correlation between variables in X
  lowCorY         = 1       #correlation between variables in Y


  #SIMULATE CORRELATED COLUMNS
  total.cols = high.nr.x+high.nr.y

  #make correlation matrix
  CorCols = diag(total.cols)
  CorCols[,] =  highCorBetween    #insert highly correlated variables
  CorCols[,] = CorCols[,] - sample(-3:3, total.cols^2, replace=TRUE)  #make data a more realistic

  #introduce negative correlation
  negatives <- sample(length(CorCols), total.cols/2, replace=TRUE)
  for(e in negatives)
    CorCols[e] = CorCols[e] * -1

  #put correlation within X in covariance matrrix
  CorCols[1:high.nr.x,1:high.nr.x] = lowCorX

  #correlation within Y
  CorCols[(high.nr.x+1):(total.cols),(high.nr.x+1):(total.cols)] = lowCorY

  #put diagonals on 100
  diag(CorCols) = 100

  #make covariance matrix symmetric
  CorCols.u <- CorCols
  CorCols.u[upper.tri(CorCols, F)] = 0
  CorCols.l <- t(CorCols.u)
  CorCols = CorCols.l + CorCols.u
  diag(CorCols) = 100

  #CorCols = CorCols/100

  #generate a high correlated matrix
  #CorCols.m = rmvnorm(n = nr_individuals, mean = rep(0,high.nr.x+high.nr.y), CorCols)
  CorCols.m = mvrnorm(n = nr_individuals, mu = rep(0,high.nr.x+high.nr.y), CorCols)
  #*****************************

  CorCols

  cov((CorCols.m))

This produces the following covariance matrix:

   >   CorCols
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]  100    3    3    3   22  -17   18   23   18   -17
 [2,]    3  100    3    3   22  -23   19   21   18    17
 [3,]    3    3  100    3   22   20   22   23   22    17
 [4,]    3    3    3  100   19   21   17   22   23    19
 [5,]   22   22   22   19  100    1    1    1    1     1
 [6,]  -17  -23   20   21    1  100    1    1    1     1
 [7,]   18   19   22   17    1    1  100    1    1     1
 [8,]   23   21   23   22    1    1    1  100    1     1
 [9,]   18   18   22   23    1    1    1    1  100     1
[10,]  -17   17   17   19    1    1    1    1    1   100
>   
>   cov((CorCols.m))
            [,1]       [,2]       [,3]      [,4]         [,5]         [,6]       [,7]        [,8]        [,9]       [,10]
 [1,] 102.488311   6.478105   3.544439  1.235431  25.93322414 -19.06442718 15.5860641  20.1082330  19.4726431 -16.3888262
 [2,]   6.478105 104.079934   5.955973  4.596520  23.09775179 -24.61767577 18.9976206  26.5669763  20.3534901  18.8990530
 [3,]   3.544439   5.955973 108.301280  1.307604  23.65638341  24.55354464 24.8988210  22.7098385  23.1617868  16.9418012
 [4,]   1.235431   4.596520   1.307604 98.914409  18.31694271  19.58798486 18.5691044  24.0341383  20.1565850  18.4877265
 [5,]  25.933224  23.097752  23.656383 18.316943 102.52463872   0.08230676  2.3322417   2.9476010   3.2918859   1.5124856
 [6,] -19.064427 -24.617676  24.553545 19.587985   0.08230676 100.48101959  2.1288953   0.2182186   2.9399330   1.3138298
 [7,]  15.586064  18.997621  24.898821 18.569104   2.33224168   2.12889526 96.6045039   2.2321378   0.3256926   2.1856988
 [8,]  20.108233  26.566976  22.709839 24.034138   2.94760099   0.21821861  2.2321378 102.9477339   2.0669838   0.5530647
 [9,]  19.472643  20.353490  23.161787 20.156585   3.29188587   2.93993300  0.3256926   2.0669838 101.4802341   1.3955998
[10,] -16.388826  18.899053  16.941801 18.487726   1.51248555   1.31382981  2.1856988   0.5530647   1.3955998  98.7497249

However, if I change the high correlation to value .8, that is

highCorBetween  = 80      #correlation between the highly correlated variables

I get the error message of "'Sigma' is not positive definite":

CorCols
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]  100    3    3    3   83   79   83   79   79    77
 [2,]    3  100    3    3   83  -81   77   78   77    82
 [3,]    3    3  100    3   80   80   78   80   78    80
 [4,]    3    3    3  100   83   80   79   79   79    82
 [5,]   83   83   80   83  100    1    1    1    1     1
 [6,]   79  -81   80   80    1  100    1    1    1     1
 [7,]   83   77   78   79    1    1  100    1    1     1
 [8,]   79   78   80   79    1    1    1  100    1     1
 [9,]   79   77   78   79    1    1    1    1  100     1
[10,]   77   82   80   82    1    1    1    1    1   100
> CorCols.m = mvrnorm(n = nr_individuals, mu = rep(0,high.nr.x+high.nr.y), CorCols)
Error in mvrnorm(n = nr_individuals, mu = rep(0, high.nr.x + high.nr.y),  : 
  'Sigma' is not positive definite

I tried to change the tolerance parameter in mvrnorm() function, for example tol=1, but that would only solve the problem of getting an error, the covariance matrix still wouldn't look like that resembles the model matrix (correlation in-between variables for both X and Y increase hugely):

> CorCols
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]  100    3    3    3   83   79   83   79   79    77
 [2,]    3  100    3    3   83  -81   77   78   77    82
 [3,]    3    3  100    3   80   80   78   80   78    80
 [4,]    3    3    3  100   83   80   79   79   79    82
 [5,]   83   83   80   83  100    1    1    1    1     1
 [6,]   79  -81   80   80    1  100    1    1    1     1
 [7,]   83   77   78   79    1    1  100    1    1     1
 [8,]   79   78   80   79    1    1    1  100    1     1
 [9,]   79   77   78   79    1    1    1    1  100     1
[10,]   77   82   80   82    1    1    1    1    1   100
>   CorCols.m = mvrnorm(n = nr_individuals, mu = rep(0,high.nr.x+high.nr.y), CorCols, tol =1)
>   #*****************************
>   
>   cov((CorCols.m))
           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
 [1,] 129.92109  25.83637  40.40527  35.12365  50.61760  54.51044  48.49376  49.51910  46.92675  45.92999
 [2,]  25.83637 135.09280  26.79751  24.19305  54.26442 -77.79020  47.86738  52.82103  55.25732  54.82464
 [3,]  40.40527  26.79751 140.75243  44.46343  53.67213  61.96498  53.79998  52.06622  52.12298  49.71284
 [4,]  35.12365  24.19305  44.46343 135.61404  51.25497  60.02818  49.28677  47.03124  48.73891  52.63492
 [5,]  50.61760  54.26442  53.67213  51.25497 129.96128  15.43909  19.99275  26.70164  26.35219  24.13483
 [6,]  54.51044 -77.79020  61.96498  60.02818  15.43909 125.73316  16.46839  12.46468  10.06989  13.02532
 [7,]  48.49376  47.86738  53.79998  49.28677  19.99275  16.46839 119.76993  28.25129  25.61649  25.96538
 [8,]  49.51910  52.82103  52.06622  47.03124  26.70164  12.46468  28.25129 119.88476  23.82021  24.48806
 [9,]  46.92675  55.25732  52.12298  48.73891  26.35219  10.06989  25.61649  23.82021 126.18663  24.85941
[10,]  45.92999  54.82464  49.71284  52.63492  24.13483  13.02532  25.96538  24.48806  24.85941 123.65103

Is there any workarounds for this problem? That is, how could I generate data that has a covariance matrix that follows closely the model matrix of a few highly correlated variables between X and Y yet having low correlation between variables in X and between variables in Y separately (such as the one denoted CorCols in the last code sample).

Any suggestions and references are welcome. Thank you!

01000001
  • 76
  • 6
  • 4
    Closely related: [Is every covariance matrix positive definite?](http://stats.stackexchange.com/q/56832/22228) (answer: every valid covariance matrix has to be positive *semi*-definite) and [Is a sample covariance matrix always symmetric and positive definite?](http://stats.stackexchange.com/q/52976/22228). Some matrices just can't be covariance matrices, which means you can't simulate data with that covariance structure. – Silverfish Mar 07 '16 at 09:30
  • 1
    `I encounter the problem of not positive definite matrices` Your second matrix (following these words) appears [_negatively_ definite](http://stats.stackexchange.com/q/69114/3277). I.e. it has some negative eigenvalues (and no zero eigenvalues). No real data (having no missings) can ever correspond to such a covariance matrix. Only positive (semi)definite cov matrix can have corresponding data. – ttnphns Mar 07 '16 at 09:33
  • 1
    @ttnphns: "Negative-definite" means that *all* eigenvalues are negative, not only some of them. – amoeba Mar 07 '16 at 10:01
  • @amoeba, eh... I don't remember. Maybe you are correct in terminology. I will check. It doesn't matter here. – ttnphns Mar 07 '16 at 10:10
  • Did you read the comments above, @01000001? Did they answer your question? – amoeba Jan 10 '17 at 09:51
  • 1
    @amoeba Thank you for your comment, now I have read the comments. It seams like I would have to accept the answer of Silverfish that it is not possible to generate data that would satisfy my predefined covariance matrix. Thank you all for the comments! – 01000001 Jan 10 '17 at 10:39

0 Answers0