creating multiple categorical variable with specified degree of association (correlation) matrix

Question

Lets say I want to generate data with particular association matrix. I am taking phi coefficient as a measure of degree of association.

Here are examples using R.

    require(psych)
    var1 <- sample(c("P", "A"), 10000, replace = TRUE)
    var2 <- sample(c("P", "A"), 10000, replace = TRUE)
    mydf <- data.frame (var1, var2)

  # degree of association 
  require(psych) 
 # No association case:  
 # random variables means 0 association expected 
  phi(table(var1, var2))
   [1] -0.01

# copy of same variable, 1 association expected. 
var3 <- var1
phi(table(var1, var3))

Assuming that I have 4 x 4 matrix of phi coefficients between the four categorical variables. Say the following is association matrix (just like correlation matrix)

amat <- matrix (c(1,0.5,0.4, 0.3, 0.5,1,0.5,0.3, 0.4,0.5,1,0.2, 0.3, 0.3, 0.2,1), 4)
rownames(amat) <- c("VarA", "VarB", "VarC", "VarD")
colnames (amat) <- c("VarA", "VarB", "VarC", "VarD")
amat 
      VarA  VarB   VarC  VarD
VarA   1     0.5    0.4   0.3
VarB   0.5     1     0.5  0.3
VarC  0.4     0.5    1    0.2
VarD  0.3     0.3    0.2   1

Is there any way to generate a data with four variables with say 10000 observations that approximately hold the above association?

I know from the post how we can do similar thing in quantitative variables. The examples does not need to be R specific, I want to know only the idea, which can translated into any programming language.

the variables are either dichotomous or polytomous with variable levels — Ram Sharma, Jun 27 '14 at 12:17
Good question. I found a paper that suggests two multivariate normal approximations. They mention in the introduction that the exact approach would be to simulate using conditional probabilities, but they point out that this is sensitive to the ordering of your variables and is computationally expensive. Source: http://link.springer.com/article/10.1007%2Fs10928-006-9033-1 — shadowtalker, Jun 30 '14 at 18:09
@ssdecontrol thanks that is really interesting. I would seek for implementation as well. — Ram Sharma, Jul 03 '14 at 15:21

John · Answer 1 · 2014-07-03T14:27:04.183

2

If the variables are just dichotomous you can treat them as binomial. Then the job becomes easier. The package bindata can simulate multivariate distribution with specified correlation. Just small example from the manual

  amat <- cbind(c(1/2,1/5,1/6),c(1/5,1/2,1/6),c(1/6,1/6,1/2))

 require(bindata)
  out <- rmvbin(n=100,commonprob=amat) # n number of samples, 

  # you can replace 0 and 1 with text variable 
  out[out==1] <- "A"
  out[out==0] <- "P"

 require(psych) 
  phi(table (out[,1], out[,2])

The detail underlying principle and method on this paper is discussed [link to pdf ].

Also for simulation of correlated ordinal data, there is another package called ordata the details underlying method is discussed in this paper.

I know you might want more, but this is what I got considering no answer so far here.

edited Jul 03 '14 at 14:27

answered Jul 02 '14 at 20:23

John

2,088
6
27
37

2

Thank you--but how does it work? The OP is asking for the *idea* rather than some package. – whuber Jul 02 '14 at 20:27
@whuber link on how the `bindata` package works is provided. I know the OP wants more, this what I got. – John Jul 03 '14 at 14:29
@John thank you, this will help someway. As bounty is about expire I will offer this to you for the effort, but still looking for the complete answer. – Ram Sharma Jul 03 '14 at 15:20

creating multiple categorical variable with specified degree of association (correlation) matrix

1 Answers1