Cholesky decomposition or alternative for negatively correlated data simulations

Question

I want to generate some signals that have a correlation distribution around a specific pre-defined correlation value (i.e., the distribution of the values of their correlation matrix is around a specific expected value (rho)).

For example, for rho = 0.5, if I want to build some signals X_synthetic that their correlation matrix is cor_matrix, then the values in np.corrcoef(X_synthetic) should all be around 0.5. So the histogram will be around 0.5 in that case.

Example to achieve this for a positive rho value:

import numpy as np

#desired expected rho (of the distribution of the corr matrix)
rho = 0.5

# desired correlation matrix
cor_matrix = np.ones((5,5))* rho
np.fill_diagonal(cor_matrix,1) # 1s in diagonal
print(cor_matrix)

# this is artificial case but it will result in the derired distribution.
array([[1. , 0.5, 0.5, 0.5, 0.5],
       [0.5, 1. , 0.5, 0.5, 0.5],
       [0.5, 0.5, 1. , 0.5, 0.5],
       [0.5, 0.5, 0.5, 1. , 0.5],
       [0.5, 0.5, 0.5, 0.5, 1. ]])


L = np.linalg.cholesky(cor_matrix)

# build some signals that will result in the desired correlation matrix
X_synthetic = L.dot(np.random.normal(0,1, (5,2000)))

# estimate their correlation matrix
np.corrcoef(X_synthetic)

array([[1.        , 0.50576661, 0.51472813, 0.47208374, 0.49260528],
       [0.50576661, 1.        , 0.4798111 , 0.48540114, 0.47225243],
       [0.51472813, 0.4798111 , 1.        , 0.4649033 , 0.4745259 ],
       [0.47208374, 0.48540114, 0.4649033 , 1.        , 0.50059795],
       [0.49260528, 0.47225243, 0.4745259 , 0.50059795, 1.        ]])

#* Very good approximation. All values are fluctuating around 0.5.
#* So the distribution of the correlation values of `X_synthetic` is around the expected value `0.5`.

Now, I want to do the same, but the values of np.corrcoef(X_synthetic) should all be around -0.3 so that the histogram would be around -0.3 in that case.

#desired expected rho (of the distribution of the corr matrix)
rho = -0.3

# desired correlation matrix
cor_matrix = np.ones((5,5))* rho
np.fill_diagonal(cor_matrix,1) # 1s in diagonal
print(cor_matrix)

array([[ 1. , -0.3, -0.3, -0.3, -0.3],
       [-0.3,  1. , -0.3, -0.3, -0.3],
       [-0.3, -0.3,  1. , -0.3, -0.3],
       [-0.3, -0.3, -0.3,  1. , -0.3],
       [-0.3, -0.3, -0.3, -0.3,  1. ]])


L = np.linalg.cholesky(cor_matrix) # fails

X_synthetic = L.dot(np.random.normal(0,1, (5,2000)))

The cholesky will fail and raise LinAlgError: Matrix is not positive definite.

I understand that this is not a realistic case, but in practice, I want to build cor_matrix in a way such as the X_synthetic signals would have a correlation matrix with values varying around -0.3, similarly to the case of 0.5 as shown above.

Just because signals are negatively correlated does not mean the correlation matrix is not positive definite! See https://en.wikipedia.org/wiki/Definiteness_of_a_matrix for the definition of "positive definite". In your case, you have created an impossible correlation matrix; no data can actually have that correlation structure. See https://stats.stackexchange.com/questions/69114/why-does-correlation-matrix-need-to-be-positive-semi-definite-and-what-does-it-m?noredirect=1&lq=1 for more. — jbowman, Feb 21 '20 at 20:49
Given your specified covariance matrix, the sum of the three random variables would have variance -1 (which is obviously impossible). So you have just proposed an impossible covariance matrix -- no random variables could have this covariance. — josliber, Feb 21 '20 at 20:53
Thanks for the replies. I have edited my question. The whole point is that I want to generate some signals that have an pre-defined expected correlation value, i.e., the mean of the histogram of their correlation matrix is around a given value. I can do that for positive values but cannot find a way for negative values. — seralouk, Feb 21 '20 at 21:22
@makis What you are asking us to simulate is impossible -- a set of 5 random variables cannot have correlation -0.5 between each pair. That is what the error means -- you're asking for an impossible correlation matrix. It's not a numerical or computational issue -- you are asking for something that is mathematically impossible for any set of 5 random variables. Try instead with -0.1 instead of -0.5 for the off-diagonal entries -- you will see that this is possible and that your code works fine. — josliber, Feb 21 '20 at 21:44
By the way this doesn't only happen with negative correlations. An example would be array([[1.0, 0.9, 0.9], [0.9, 1.0, 0.1], [0.9, 0.1, 1.0]]). If A and B have correlation 0.9 and A and C have correlation 0.9, then it's impossible for B and C to have correlation of only 0.1 -- it must be higher. In short it takes more than correlations between -1.0 and 1.0 for a correlation matrix to be possible -- "Matrix is not positive definite" means your correlation matrix is not possible. — josliber, Feb 21 '20 at 21:50
Wonderfull explanation thank you. I tried with `-0.1` and even `-0.25` and indeed the correlation matrix is valid (positive definite). Any tip/idea how to achive this for `-0.3`. Maybe indead of putting (everywhere excpet the diagonal) random values within a range e.g. `-0.25` to `0.32` and make it symmetric ? — seralouk, Feb 21 '20 at 22:15
@makis To me, the request is somewhat odd. Think about random variables A, B, and C. If A and B have correlation -0.5 and A and C have correlation -0.5, then A is negatively associated with both, and I would guess that B and C would have a *positive* correlation with each other. So the requirement that all pairs are negatively associated seems weird (and, as we've seen, is impossible if you want them to be highly negatively correlated). Perhaps you could edit the question to describe why you think you have RVs with all pairs having strong negative correlation. — josliber, Feb 21 '20 at 22:46
I edited by adding more information. I do not want all values in the correlation matrix to be the same -- I want the distribution of the values of the correlation matrix to be around a specific value. The only way to achieve this so far is by doing as shown in my code. — seralouk, Feb 21 '20 at 23:21
@makis OK; sadly there's no possible answer other than "that's not mathematically possible." — josliber, Feb 21 '20 at 23:37
I see. Thanks for your answer. I will need to reformulate my thoughts about this. — seralouk, Feb 21 '20 at 23:51

josliber · Accepted Answer · 2020-02-22T03:31:12.810

Ultimately you are asking how to sample from a $d$-dimensional random variable with the following correlation matrix (for some fixed $\rho$):

$$ V = \left[\begin{array}{cccc} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \cdots & \cdots & \cdots & \cdots \\ \rho & \rho & \cdots & 1\end{array}\right] $$

As you note, your sample correlation will not exactly equal $V$, but it should be close for a sufficiently large sample size.

Any correlation matrix $V$ must be positive semidefinite, and your approach to sampling from a multivariate normal distribution via Cholesky decomposition requires $V$ to be positive definite. A simple application of the matrix determinant lemma tells us that $V$ has determinant $(1+\rho(d-1))(1-\rho)^{d-1}$; Sylvester's criterion tells us that $V$ is positive definite for $\frac{-1}{d-1} < \rho < 1$ and positive semidefinite for $\frac{-1}{d-1} \leq \rho \leq 1$.

The cause for your error is clear -- you are attempting to define $V$ with $d=5$ and $\rho=-0.3$, but this yields a matrix $V$ that is not positive semidefinite (and therefore not a valid correlation matrix). No 5-dimensional random variable has pairwise correlations of -0.3 -- 5-dimensional random variables with all pairwise correlations equal can only have correlations $-0.25\leq \rho\leq 1$ (and your approach with Cholesky decomposition will only work for $-0.25 < \rho < 1$). So ultimately you won't be able to draw a 5-dimensional sample with all pairwise correlations very close to -0.3 -- this is mathematically impossible.

Great answer. Now I understand my problem. One last question to make sure I get it. In my code I use `np.random.normal(0,1, (5,2000))` so that I get `N=5` signals with `d=2000` samples. So 5 signals each having 2000 values (elements). The correlation matrix of these 5 signals will then be `5x5` since I have 5 signals in that case. In your answer, I guess `d` refers to the number of signals right? Like here: https://www.dsprelated.com/showarticle/1241.php — seralouk, Feb 22 '20 at 12:05

Cholesky decomposition or alternative for negatively correlated data simulations

1 Answers1