33

For example, in R, the MASS::mvrnorm() function is useful for generating data to demonstrate various things in statistics. It takes a mandatory Sigma argument which is a symmetric matrix specifying the covariance matrix of the variables. How would I create a symmetric $n\times n$ matrix with arbitrary entries?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
rsl
  • 845
  • 2
  • 9
  • 15
  • 5
    I think this question would benefit from being edited to focus on "how can I create an arbitrary covariance matrix" and less on the coding aspect. There is certainly an on-topic underlying statistical issue here, as demonstrated by the answer. – Silverfish May 31 '16 at 22:23
  • 3
    Related: [How to efficiently generate random positive-semidefinite correlation matrices?](http://stats.stackexchange.com/questions/2746) – amoeba Jun 02 '16 at 13:50

4 Answers4

41

I like to have control over the objects I create, even when they might be arbitrary.

Consider, then, that all possible $n\times n$ covariance matrices $\Sigma$ can be expressed in the form

$$\Sigma= P^\prime\ \text{Diagonal}(\sigma_1,\sigma_2,\ldots, \sigma_n)\ P$$

where $P$ is an orthogonal matrix and $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n \ge 0$.

Geometrically this describes a covariance structure with a range of principal components of sizes $\sigma_i$. These components point in the directions of the rows of $P$. See the figures at Making sense of principal component analysis, eigenvectors & eigenvalues for examples with $n=3$. Setting the $\sigma_i$ will set the magnitudes of the covariances and their relative sizes, thereby determining any desired ellipsoidal shape. The rows of $P$ orient the axes of the shape as you prefer.

One algebraic and computing benefit of this approach is that when $\sigma_n \gt 0$, $\Sigma$ is readily inverted (which is a common operation on covariance matrices):

$$\Sigma^{-1} = P^\prime\ \text{Diagonal}(1/\sigma_1, 1/\sigma_2, \ldots, 1/\sigma_n)\ P.$$

Don't care about the directions, but only about the ranges of sizes of the the $\sigma_i$? That's fine: you can easily generate a random orthogonal matrix. Just wrap $n^2$ iid standard Normal values into a square matrix and then orthogonalize it. It will almost surely work (provided $n$ isn't huge). The QR decomposition will do that, as in this code

n <- 5
p <- qr.Q(qr(matrix(rnorm(n^2), n)))

This works because the $n$-variate multinormal distribution so generated is "elliptical": it is invariant under all rotations and reflections (through the origin). Thus, all orthogonal matrices are generated uniformly, as argued at How to generate uniformly distributed points on the surface of the 3-d unit sphere?.

A quick way to obtain $\Sigma$ from $P$ and the $\sigma_i$, once you have specified or created them, uses crossprod and exploits R's re-use of arrays in arithmetic operations, as in this example with $\sigma=(\sigma_1, \ldots, \sigma_5) = (5,4,3,2,1)$:

Sigma <- crossprod(p, p*(5:1))

As a check, the Singular Value decomposition should return both $\sigma$ and $P^\prime$. You may inspect it with the command

svd(Sigma)

The inverse of Sigma of course is obtained merely by changing the multiplication by $\sigma$ into a division:

Tau <- crossprod(p, p/(5:1))

You may verify this by viewing zapsmall(Sigma %*% Tau), which should be the $n\times n$ identity matrix. A generalized inverse (essential for regression calculations) is obtained by replacing any $\sigma_i \ne 0$ by $1/\sigma_i$, exactly as above, but keeping any zeros among the $\sigma_i$ as they were.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • It might help to demonstrate how to use the rows of $P$ to orient the axes as preferred. – gung - Reinstate Monica May 31 '16 at 23:33
  • 1
    Might be worth mentioning that the singular values in `svd(Sigma)` will be reordered -- that confused me for a minute. – FrankD Mar 28 '18 at 10:45
  • 1
    @whuber could you edit please the line of code to parametrize n: crossprod(p, p*(n:1)). In order to improve reproducibility, good answer! – Cristóbal Alcázar Feb 02 '20 at 15:54
  • @CristóbalAlcázar Thank you for your comment. Because that's a specific example and is not intended for general purposes, I think it's best to keep the values hard-coded as they are. – whuber Feb 02 '20 at 16:37
29

Create an $n\times n$ matrix $A$ with arbitrary values

and then use $\Sigma = A^T A$ as your covariance matrix.

For example

n <- 4  
A <- matrix(runif(n^2)*2-1, ncol=n) 
Sigma <- t(A) %*% A
Henry
  • 30,848
  • 1
  • 63
  • 107
  • Likewise, `Sigma – rsl May 31 '16 at 08:45
  • 12
    @MoazzemHossen: Your suggestion will produce a symmetric matrix, but it may not always be positive semidefinite (e.g. your suggestion could produce a matrix with negative eigenvalues) and so it may not be suitable as a covariance matrix – Henry May 31 '16 at 10:30
  • Yes, I noticed that R returns error in the event my suggested way produced unsuitable matrix. – rsl May 31 '16 at 18:32
  • 7
    Note that if you prefer a correlation matrix for better interpretability, there is the [?cov2cor](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html) function, which can be applied subsequently. – gung - Reinstate Monica May 31 '16 at 22:58
  • 1
    @B11b: You need your covariance matrix to be positive semi-definite. That would put some limits on the covariance values, not totally obvious ones when $n \gt 2$ – Henry Jan 11 '18 at 09:26
  • thank you for your reply @Henry. Please can you write what you said as a clear command in R? I will be happy. Thank you so much – 1190 Jan 11 '18 at 18:25
  • @B11b try the `matrixcalc` or `sfsmisc` or `corpcor` packages or some of the others mentioned in other answers to see if they have what you are looking for – Henry Jan 11 '18 at 19:19
8

You can simulate random positive definite matrices from the Wishart distribution using the function "rWishart" from stats (included in base)

n <- 4
rWishart(1,n,diag(n))
Carlos Llosa
  • 593
  • 4
  • 7
4

There is a package specifically for that, clusterGeneration (written among other by Harry Joe, a big name in that field).

There are two main functions:

  • genPositiveDefMat generate a covariance matrix, 4 different methods
  • rcorrmatrix : generate a correlation matrix

Quick example:

library(clusterGeneration)
#> Loading required package: MASS
genPositiveDefMat("unifcorrmat",dim=3)
#> $egvalues
#> [1] 15.408962  5.673916  1.228842
#> 
#> $Sigma
#>          [,1]     [,2]     [,3]
#> [1,] 6.714871 1.643449 6.530493
#> [2,] 1.643449 6.568033 2.312455
#> [3,] 6.530493 2.312455 9.028815
genPositiveDefMat("eigen",dim=3)
#> $egvalues
#> [1] 8.409136 4.076442 2.256715
#> 
#> $Sigma
#>            [,1]       [,2]      [,3]
#> [1,]  2.3217300 -0.1467812 0.5220522
#> [2,] -0.1467812  4.1126757 0.5049819
#> [3,]  0.5220522  0.5049819 8.3078880

Created on 2019-10-27 by the reprex package (v0.3.0)

Finally, note that an alternative approach is to do a first try from scratch, then use Matrix::nearPD() to make your matrix positive-definite.

Matifou
  • 2,699
  • 14
  • 25