8

Apology if this is too simple. I couldn't get the more advanced r-help group to respond.

I am planning to characterize workloads by measuring the correlation coefficient of two sets of real values but before that I wish to generate two sets of sample values that have a high coefficient and a low coefficient. I want to plot both in the same graph so that I can see the highly correlated values' together(peaks and troughs). I use R and know about rseek.

If there is any particular R book that could help my capcaity planning efforts I will buy it.

Generate a random variable with a defined correlation to an existing variable is a tad too advanced for me at this time.

Note : The two sets of values that I am about to plot are related because I am plotting CPU usage and a througput number. So if the no: of bytes increases the CPU usage may increase. Both are postitive values. So if the correlation is high both will either increase together or decrease together.

Thanks.

  • 2
    What about the distribution of the values? Are they supposed to be Gaussian? – gui11aume Jun 18 '12 at 09:53
  • **Quick and dirty**: take two equal-length vectors of values `x` and `y` (such as the "real values" in the question, or simulations thereof). Look at `plot(sort(x),sort(y))`, `plot(x,sample(y))`, and `plot(sort(x),-sort(-y))` to see the extreme behaviors (high correlation, almost no correlation, high *negative* correlation). – whuber Jun 18 '12 at 21:22

4 Answers4

6

You can for example generate data from a bivariate normal distribution. The off-diagonal entry of the variance-covariance matrix is the covariance. In R, this can readily be done with rmvnorm.

Example Generate $1000$ realisations from $X=(X_{1}, X_{2})' \sim N(\mu, \Sigma)$ with $$\mu = (-1, 5)', \quad \Sigma_{11} = V(X_{1}) = 0.7, \quad \Sigma_{22}= V(X_{2}) = 0.1$$ and $\Sigma_{12} = \Sigma_{21} = \textrm{Cov}(X_1, X_2)$ such that $\textrm{Cor}(X_{1}, X_{2})=0.85$.

> #------load the package------
> library(mvtnorm)
> #----------------------------
> 
> #------compute the covariance such that cor(X1, X2) = 0.85------
> covariance <- 0.85 * sqrt(0.7) * sqrt(0.1)
> #---------------------------------------------------------------
> 
> #------variance-covariance matrix------
> sigma <- matrix(c(0.7, covariance, covariance, 0.1), nrow=2, byrow=TRUE)
> sigma
          [,1]      [,2]
[1,] 0.7000000 0.2248889
[2,] 0.2248889 0.1000000
> #--------------------------------------
> 
> #------data generation------
> test <- rmvnorm(n=1000, mean=c(-1, 5), sigma=sigma)
> #---------------------------
> 
> #------compute the empirical correlation on this particular data------
> cor(test[, 1], test[, 2])
[1] 0.8478849
> #---------------------------------------------------------------------

$$$$

NB: You can also generate data according to a linear regression model: $X_2 = a + bX_1 + \epsilon$.

ocram
  • 19,898
  • 5
  • 76
  • 77
  • Reminding that I am a learner. The mention of linear regression here seems to mean that the two sets of values that I am about to plot are related which is true because I am plotting CPU usage and a througput number. So if the no: of bytes increases the CPU usage may increase. Both are postitive values. Apology for alluding to capacity planning and not mentioning this properly . Both R code samples( ocram and user603 ) have negatives and the values are not continuously increasing or decreasing. – Mohan Radhakrishnan Jun 19 '12 at 06:52
  • You should have added this requirement in the question. The method based on a linear model or other methods proposed in this page can make the job. – ocram Jun 19 '12 at 07:24
5

Others have given you code. Here is an idea behind that.

Generate $X$, and then let $Y = X+Z$, where $Z$ is independent of $X$.

If $var(Z)$ is small compared with $var(X)$ then the correlation between $X$ and $Y$ will be high. If $var(Z)$ is large compared with $var(X)$ then the correlation between $X$ and $Y$ will be low.

Douglas Zare
  • 10,278
  • 2
  • 38
  • 46
  • 3
    (+1) Since it is easy to characterize exactly what the (theoretical) correlation will be, adding that small bit might improve this answer even further. – cardinal Jun 18 '12 at 13:11
2
library("MASS")
highCor<-matrix(c(1,0.9,0.9,1),2,2)
lowCor<-matrix(c(1,0.1,0.1,1),2,2)
x_hc<-mvrnorm(100,rep(0,2),highCor)
x_lc<-mvrnorm(100,rep(0,2),lowCor)
plot(rbind(x_hc,x_lc),type="n")
points(x_lc,pch=16,col="green")#low correlation in green
points(x_hc,pch=16,col="blue") #high correlation in blue
user603
  • 21,225
  • 3
  • 71
  • 135
1

The answers given here as well as the checked answer to the previous post give you a lot of valid ways to do this. My suggestion would have been the same as the NB given above by ocram. Take a linear function $Y=a+bX$ and add an error term $N(0, σ)$ with a small value for the standard deviation $σ$. This will generate a pair of random variables with a high correlation. To generate a pair of variables with low correlation just take a large value for $σ^2$.

Macro
  • 40,561
  • 8
  • 143
  • 148
Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143