Generate sets of values with high correlation coefficient

Question

Apology if this is too simple. I couldn't get the more advanced r-help group to respond.

I am planning to characterize workloads by measuring the correlation coefficient of two sets of real values but before that I wish to generate two sets of sample values that have a high coefficient and a low coefficient. I want to plot both in the same graph so that I can see the highly correlated values' together(peaks and troughs). I use R and know about rseek.

If there is any particular R book that could help my capcaity planning efforts I will buy it.

Generate a random variable with a defined correlation to an existing variable is a tad too advanced for me at this time.

Note : The two sets of values that I am about to plot are related because I am plotting CPU usage and a througput number. So if the no: of bytes increases the CPU usage may increase. Both are postitive values. So if the correlation is high both will either increase together or decrease together.

Thanks.

What about the distribution of the values? Are they supposed to be Gaussian? — gui11aume, Jun 18 '12 at 09:53
**Quick and dirty**: take two equal-length vectors of values `x` and `y` (such as the "real values" in the question, or simulations thereof). Look at `plot(sort(x),sort(y))`, `plot(x,sample(y))`, and `plot(sort(x),-sort(-y))` to see the extreme behaviors (high correlation, almost no correlation, high *negative* correlation). — whuber, Jun 18 '12 at 21:22

ocram · Accepted Answer · 2012-06-18T15:00:00.530

You can for example generate data from a bivariate normal distribution. The off-diagonal entry of the variance-covariance matrix is the covariance. In R, this can readily be done with rmvnorm.

Example Generate $1000$ realisations from $X=(X_{1}, X_{2})' \sim N(\mu, \Sigma)$ with $$\mu = (-1, 5)', \quad \Sigma_{11} = V(X_{1}) = 0.7, \quad \Sigma_{22}= V(X_{2}) = 0.1$$ and $\Sigma_{12} = \Sigma_{21} = \textrm{Cov}(X_1, X_2)$ such that $\textrm{Cor}(X_{1}, X_{2})=0.85$.

> #------load the package------
> library(mvtnorm)
> #----------------------------
> 
> #------compute the covariance such that cor(X1, X2) = 0.85------
> covariance <- 0.85 * sqrt(0.7) * sqrt(0.1)
> #---------------------------------------------------------------
> 
> #------variance-covariance matrix------
> sigma <- matrix(c(0.7, covariance, covariance, 0.1), nrow=2, byrow=TRUE)
> sigma
          [,1]      [,2]
[1,] 0.7000000 0.2248889
[2,] 0.2248889 0.1000000
> #--------------------------------------
> 
> #------data generation------
> test <- rmvnorm(n=1000, mean=c(-1, 5), sigma=sigma)
> #---------------------------
> 
> #------compute the empirical correlation on this particular data------
> cor(test[, 1], test[, 2])
[1] 0.8478849
> #---------------------------------------------------------------------

$$$$

NB: You can also generate data according to a linear regression model: $X_2 = a + bX_1 + \epsilon$.

Reminding that I am a learner. The mention of linear regression here seems to mean that the two sets of values that I am about to plot are related which is true because I am plotting CPU usage and a througput number. So if the no: of bytes increases the CPU usage may increase. Both are postitive values. Apology for alluding to capacity planning and not mentioning this properly . Both R code samples( ocram and user603 ) have negatives and the values are not continuously increasing or decreasing. — Mohan Radhakrishnan, Jun 19 '12 at 06:52
You should have added this requirement in the question. The method based on a linear model or other methods proposed in this page can make the job. — ocram, Jun 19 '12 at 07:24

score 5 · Answer 2 · answered Jun 18 '12 at 08:49

5

Others have given you code. Here is an idea behind that.

Generate $X$, and then let $Y = X+Z$, where $Z$ is independent of $X$.

If $var(Z)$ is small compared with $var(X)$ then the correlation between $X$ and $Y$ will be high. If $var(Z)$ is large compared with $var(X)$ then the correlation between $X$ and $Y$ will be low.

answered Jun 18 '12 at 08:49

Douglas Zare

10,278
2
38
46

3

(+1) Since it is easy to characterize exactly what the (theoretical) correlation will be, adding that small bit might improve this answer even further. – cardinal Jun 18 '12 at 13:11

user603 · Answer 3 · 2012-06-18T10:29:43.670

2

library("MASS")
highCor<-matrix(c(1,0.9,0.9,1),2,2)
lowCor<-matrix(c(1,0.1,0.1,1),2,2)
x_hc<-mvrnorm(100,rep(0,2),highCor)
x_lc<-mvrnorm(100,rep(0,2),lowCor)
plot(rbind(x_hc,x_lc),type="n")
points(x_lc,pch=16,col="green")#low correlation in green
points(x_hc,pch=16,col="blue") #high correlation in blue

edited Jun 18 '12 at 10:29

answered Jun 18 '12 at 08:31

user603

21,225
3
71
135

score 1 · Answer 4 · edited Jun 18 '12 at 22:16

The answers given here as well as the checked answer to the previous post give you a lot of valid ways to do this. My suggestion would have been the same as the NB given above by ocram. Take a linear function $Y=a+bX$ and add an error term $N(0, σ)$ with a small value for the standard deviation $σ$. This will generate a pair of random variables with a high correlation. To generate a pair of variables with low correlation just take a large value for $σ^2$.

Generate sets of values with high correlation coefficient

4 Answers4