60

I'm sorry if this seems a bit too basic, but I guess I'm just looking to confirm understanding here. I get the sense I'd have to do this in two steps, and I've started trying to grok correlation matrices, but it's just starting to seem really involved. I'm looking for a concise explanation (ideally with hints towards a pseudocode solution) of a good, ideally quick way to generate correlated random numbers.

Given two pseudorandom variables height and weight with known means and variances, and a given correlation, I think I'm basically trying to understand what this second step should look like:

   height = gaussianPdf(height.mean, height.variance)
   weight = gaussianPdf(correlated_mean(height.mean, correlation_coefficient), 
                        correlated_variance(height.variance, 
                        correlation_coefficient))
  • How do I calculate the correlated mean and variance? But I want to confirm that's really the relevant problem here.
  • Do I need to resort to matrix manipulation? Or do I have something else very wrong in my basic approach to this problem?
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
Joseph Weissman
  • 703
  • 1
  • 6
  • 7
  • 1
    Not sure I understand you correctly, but you don't have to calculate "correlated mean and variance". If you're assuming that the variables are bivariate normal, it should be enough to specify the individual means and variances and the correlation. Is there any particular software you want to use for this? – mark999 Oct 07 '12 at 20:29
  • 3
    The following Qs are strongly related & will be of interest: [How to define a distribution such that draws from it correlate with a draw from another pre-specified distribution?](http://stats.stackexchange.com/questions/13382/) & [Generate a random variable with a defined correlation to an existing variable](http://stats.stackexchange.com/questions/15011/). – gung - Reinstate Monica Oct 16 '12 at 16:11
  • 1
    Also: [How can I generate data with a prespecified correlation matrix?](http://stats.stackexchange.com/questions/32169/) – gung - Reinstate Monica Sep 10 '13 at 20:10

4 Answers4

52

To answer your question on "a good, ideally quick way to generate correlated random numbers": Given a desired variance-covariance matrix $C$ that is by definition positive definite, the Cholesky decomposition of it is: $C$=$LL^T$; $L$ being lower triangular matrix.

If you now use this matrix $L$ to project an uncorrelated random variable vector $X$, the resulting projection $Y = LX$ will be that of correlated random variables.

You can find an concise explanation why this happens here.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • 8
    Does this method apply only for Gaussian distributions (as specified in the question), or can it be used for generating correlated variables that follow other distributions? If not, are you aware of a method that could be used in that case? – user000001 Mar 03 '13 at 15:03
  • Does it also work if I take any $L$ *not* necessarily triangular but satisfying $C=LL^T$? – Michael Apr 01 '16 at 17:51
  • 1
    @Michael: Yes. Having said that given $C$ is a valid covariance matrix the Cholesky decomposition is the fastest way. You could also get the (symmetric) square root $X$ matrix of $C$ by using SVD (so $C = XX = XX^T$, where $X = U S^{0.5} V^T$ from $C = USV^T$) but that would be more expensive too. – usεr11852 Apr 01 '16 at 18:29
  • Thanks. Also, when I use two different matrices satisfying the condition, let's say $L_1$ is the Cholesky decomposition, and $L_2$ is the square root of $C$, which exists since $C\geq 0$, also $L_2 = L_2^T$ then $L_2L_2 = L_2L_2^T = C$, I get $Y_1 = L_1X$ and $Y_2=L_2X$ which are different. Is that how it is supposed to be? – Michael Apr 03 '16 at 18:35
  • 1
    @Michael: Of course. Their covariance will be (approximately) the same, not the numbers themselves. – usεr11852 Apr 03 '16 at 19:39
  • 1
    I don't see an answer to user000001's question, so in case it has been deleted, let me check with you again: does this method apply for any distribution, or does it only apply for the normal distribution? Thank you. – Sid Jun 24 '17 at 00:11
  • 1
    @Sid: Any continuous distribution not supported on the whole real line will immediately fail. For example if we use a uniform $U[0,1]$ we cannot guarantee that the "correlated numbers" will be in $[0,1]$; similarly for a Poisson we will end up with non-discrete numbers. In addition, any distribution where the sum of the distributions is not still the same distribution (eg. summing $t$-distribution does not result in $t$-distributions) will also fail. In all cases mentioned, the numbers produced *will be correlated* according the $C$ but they will not correspond to the distribution we started. – usεr11852 Jun 24 '17 at 12:55
  • @usεr11852 Thanks a lot. I am interested in the beta distribution, which fails to meet two of your criteria since it is not supported on the whole real line and does not have the sum-property that you described. But does this mean that there exists no method at all that works for those type of distributions, or only that this method in particular does not work? – Sid Jun 24 '17 at 13:14
  • @Sid, methods to generate correlated beta variables exist; please see the comments and answer in the threads [here](https://stats.stackexchange.com/questions/285142) and [here](https://stats.stackexchange.com/questions/284996) by whuber for more information. – usεr11852 Jun 24 '17 at 19:00
  • Thank you very much. Unfortunately, the former link only seems to cover cases with two variables and the latter is for the binomial distribution. But it might be worth it to post a separate question. – Sid Jun 24 '17 at 20:23
39

+1 to @user11852, and @jem77bfp, these are good answers. Let me approach this from a different perspective, not because I think it's necessarily better in practice, but because I think it's instructive. Here are a few relevant facts that we know already:

  1. $r$ is the slope of the regression line when both $X$ and $Y$ are standardized, i.e., $\mathcal N(0,1)$,
  2. $r^2$ is the proportion of the variance in $Y$ attributable to the variance in $X$,



    (also, from the rules for variances):

  3. the variance of a random variable multiplied by a constant is the constant squared times the original variance:
    $$\text{Var}[aX]=a^2\text{Var}[X]$$
  4. variances add, i.e., the variance of the sum of two random variables (assuming they are independent) is the sum of the two variances:
    $$\text{Var}[X+\varepsilon]=\text{Var}[X]+\text{Var}[\varepsilon]$$

Now, we can combine these four facts to create two standard normal variables whose populations will have a given correlation, $r$ (more properly, $\rho$), although the samples you generate will have sample correlations that vary. The idea is to create a pseudorandom variable, $X$, that is standard normal, $\mathcal N(0,1)$, and then find a coefficient, $a$, and an error variance, $v_e$, such that $Y \sim\mathcal N(0,a^2+v_e)$, where $a^2+v_e=1$. (Note that $|a|$ must be $\le 1$ for this to work, and that, moreover, $a=r$.) Thus, you start with the $r$ that you want; that's your coefficient, $a$. Then you figure out the error variance that you will need, it's $1-r^2$. (If your software requires you to use the standard deviation, take the square root of that value.) Finally, for each pseudorandom variate, $x_i$, that you have generated, generate a pseudorandom error variate, $e_i$, with the appropriate error variance $v_e$, and compute the correlated pseudorandom variate, $y_i$, by multiplying and adding.

If you wanted to do this in R, the following code might work for you:

correlatedValue = function(x, r){
  r2 = r**2
  ve = 1-r2
  SD = sqrt(ve)
  e  = rnorm(length(x), mean=0, sd=SD)
  y  = r*x + e
  return(y)
}

set.seed(5)
x = rnorm(10000)
y = correlatedValue(x=x, r=.5)

cor(x,y)
[1] 0.4945964

(Edit: I forgot to mention:) As I've described it, this procedure gives you two standard normal correlated variables. If you don't want standard normals, but want the variables to have some specific means (not 0) and SDs (not 1), you can transform them without affecting the correlation. Thus, you would subtract the observed mean to ensure that the mean is exactly $0$, multiply the variable by the SD you want and then add the mean you want. If you want the observed mean to fluctuate normally around the desired mean, you would add the initial difference back. Essentially, this is a z-score transformation in reverse. Because this is a linear transformation, the transformed variable will have the same correlation with the other variable as before.

Again, this, in it's simplest form, only lets you generate a pair of correlated variables (this could be scaled up, but gets ugly fast), and is certainly not the most convenient way to get the job done. In R, you would want to use ?mvrnorm in the MASS package, both because it's easier and because you can generate many variables with a given population correlation matrix. Nonetheless, I think it's worthwhile to have walked through this process to see how some basic principles play out in a simple way.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • This essentially regressional approach is particularly nice letting one to generate one random Y correlated with any number of _existing_ X "predictors". Am I right in such understanding? – ttnphns Aug 04 '19 at 12:09
  • It depends on exactly what pattern of correlations amongst the variables you want, @ttnphns. You can iterate this through one after another, but it would get tedious. To create many correlated variables with a given pattern, it's better to use the Cholesky decomposition. – gung - Reinstate Monica Aug 04 '19 at 12:29
  • gung, do you know how to use Cholesky to generate one Y correlated (approximately, as in your method) according to a vector of correlations with several _existing_ (not simulated) Xs? – ttnphns Aug 04 '19 at 12:35
  • @ttnphns, you want to generate a single Y w/ a given population correlation w/ a set of X's, not a set of p variables that all have prespecified population correlations? A simple way would be to write a regression equation to generate a single Y-hat from your X's, then use the method above to generate Y as a correlate of your Y-hat. You could ask a new question about it, if you want. – gung - Reinstate Monica Aug 04 '19 at 12:41
  • 1
    This is what I meant in my initial comment: this method would be a direct extension of what you speak of in your answer: essentially a regressional (Hat) method. – ttnphns Aug 04 '19 at 12:47
16

In general this not a simple thing to do, but I believe there are packages for multivariate normal variable generation (at least in R, see mvrnorm in the MASS package), where you just input a covariance matrix and a mean vector.

There is also one more "constructive" approach. Let's say we want to model a random vector $(X_1,X_2)$ and we have its distribution function $F(x_1,x_2)$. The first step is to get the marginal distribution function; i.e. integrate $F$ over all $x_2$: $$F_{X_1}(x_1)= \int_{-\infty}^{\infty} F(x_1,x_2)dx_2. $$ Then we find $F^{-1}_{X_1}$ - the inverse function of $F_{X_1}$ - and plug in a random variable $\xi_1$ which is uniformly distributed on the interval $[0,1]$. In this step we generate the first coordinate $\hat{x}_1=F^{-1}_{X_1}(\xi)$.

Now, since we have got one coordinate, we need to plug it in to our initial distribution function $F(x_1,x_2)$ and then get a conditional distribution function with condition $x_1=\hat{x}_1$: $$F(x_2 | X_1=\hat{x}_1)= \frac{F(\hat{x}_1,x_2)}{f_{X_1}(\hat{x}_1)}, $$ where $f_{X_1}$ is a probability density function of the marginal $X_1$ distribution; i.e. $F'_{X_1}(x_1)=f_{X_1}(x_1)$.

Then again you generate a uniformly distributed variable $\xi_2$ on $[0,1]$ (independent of $\xi_1$) and plug it in to the inverse of $F(x_2 | X_1=\hat{x}_1)$. Therefore you obtain $\hat{x}_2=(F(x_2 | X_1=\hat{x}_1))^{-1}(\xi)$; that is, $\hat x_2$ satisfies $F(\hat x_2 | X_1=\hat{x}_1) = \xi$. This method can be generalized to vectors with more dimensions, but its downside is that you have to calculate, analytically or numerically, many functions. The idea can be found in this article as well: http://www.econ-pol.unisi.it/dmq/pdf/DMQ_WP_34.pdf.

If you don't understand the meaning of plugging a uniform variable into an inverse probability distribution function, try to make a sketch of the univariate case and then remember what the geometric interpretation of the inverse function is.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
jem77bfp
  • 549
  • 1
  • 6
  • 15
  • Smart idea! Has simple intuitive appeal. But yes seems expensive computationally. – MichaelChirico Sep 28 '16 at 16:27
  • (+1) very good point. It would be better at the start saying $f_{X,Y}(x,y)=f_X(x)\cdot f_{Y|X}(y)$, then it flows more natural to first generate one mariginal distribution and then the conditional distribution. Very excellent! – KevinKim Jan 18 '17 at 22:04
1

If you are ready to give up efficiency, you can use a throw-away alogorithm. Its advantage is, that it allows for any kind of distributions (not only Gaussian).

Start by generating two uncorrelated sequences of random numbers $\{x_i\}_{i=1}^N$ and $\{y_i\}_{i=1}^N$ with any desired distributions. Let $C$ by the desired value of the correlation coefficient. Then do the following:

1) Compute correlation coefficient $c_{old}=corr(\{x_i\},\{y_i\})$

2) Generate two random munbers $n_1$ and $n_2: 1 \leq n_{1,2} \leq N$

3) Swap numbers $x_{n_1}$ and $x_{n_2}$

4) Compute new correlation $c_{new}=corr( \{x_i\},\{y_i\})$

5) If $|C-c_{new}| < |C-c_{old}|$ then keep the swap. Else undo the swap.

6) If $|C-c| < \epsilon$ stop, else goto 1)

Random swaps will not alter the marginal distribution of ${x_i}$.

Good luck!

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
F. Jatpil
  • 111
  • 2
  • I'm slightly confused by notation. Is $x_i$ a vector? If not, what does $corr(x_i, y_i)$ mean? – Cliff AB Jun 05 '19 at 20:44
  • I am sorry, I am a hobbyist in statistics - I am not fammiliar with notations. $x_i$ is a number, $\{x_i\}$ is a sequnce of numbers (characterized by mean, variance, probability distribution), and so is $y$. $corr(x_i,y_i)$ is not well written, it should be $corr(\{x_i\},\{y_i\}) = (1/N) \Sigma_{i=1}^{N}(x_i- \bar x)(y_y - \bar y)$ – F. Jatpil Jun 05 '19 at 21:32
  • I see, makes perfect sense. I ignored the "$\{ \}$" in $corr(\{x_i\}, \{y_i\})$ – Cliff AB Jun 05 '19 at 21:53