Generate random correlated categorical (ordinal) variables

Question

Lets say I want to generate 100 observations of 2 likert scaled, normally? distributed variables with 10 categories (1-10) and a pearson correlation of f.e. ~0.8. I am aware that using pearson correlation on categorical data is controversial, but I really need it that way, because my goal is to analyze how pearsons r behaves, if i change the distances between the likert items. (any literature if existent would also be appreciated)

So it would look something like this:

while(cor(x,y)<= 0.8){
  x <- sample( 1:10, 100, replace=TRUE, prob=c(?) )
  y <- sample( 1:10, 100, replace=TRUE, prob=c(?) )
}

Of course this is super inefficient, but you get an idea of what I want. I am also not sure about the distribution. Is there any common distribution or set of probabilities most commonly occuring with likert scales like:

strongly agree | undecided | disagree | strongly disagree

or should I just use uniform distribution?

Also check this thread: http://stats.stackexchange.com/questions/22968/simulating-responses-to-a-test-for-item-response-theory — Tim, Dec 12 '14 at 20:29
Likert data can never be normally distributed. You can assume that the underlying latent variable is normally distributed, but you will need to specify category marginal probabilities (ie, what proportion of the latent variable is categorized as strongly agree, etc). Also, categorization reduces the correlation, so you either want latent variables that are more highly correlated or likert variables that are less correlated. — gung - Reinstate Monica, Dec 14 '14 at 15:22

score 3 · Accepted Answer · answered Dec 12 '14 at 20:28

3

Technically, you want to use a multinomial distribution in each sample. This can be efficiently sampled in R by rmultinom(n, size, prob). Your question remains of how to choose the vector prob. Uniform distribution would be of course prob=rep(1/100,100), but you can of course choose other shares. They should resemble the shares you are used to see on real Likert scale data in your field.

What you should keep in mind is that the choice of prob is highly associated with the pearson correlations that may occur (and of course with the numerical coding of the Likert scale, but you know that). So if you follow your strategy to discard a sample if it doesn't hit your target region of Pearson correlation, your final multinomial distribution might be very different, as can be shown by theory of copulas. Depending on the scope of your research, I would suggest to report both specified and observed prob vectors and add the disclaimer that your results depend on your arbitrary choice of prob. A systematic choice of prob would be a mathematician's job.

answered Dec 12 '14 at 20:28

Horst Grünbusch

5,020
17
22

I might not have expressed myself good enough, I'll give it another try: rmultinom would only give me random data, but I can't controll the correlation between the variables, which is my main focus. It does not have to be exactly 0.8(~0.8). The probabilities should not refer to any field, because in my case there is none. Simplified I want to analyze the behaviour of pearsons r, when the distances of the items of a likert scale are not equal. Therefore I am not sure what probabilities I should assume. I guess I'll just use uniform if there is none reoccuring dist for likert scales. – jannic Dec 12 '14 at 21:05
OK, there is a formula on how to calculate the Pearson correlation coefficient out of the copula (carrying all the dependency information) and a monotonous transformation carrying the distances information you're interested in. I'm just looking it up somewhere. I know it is written in Nelsen's book on copulas. – Horst Grünbusch Dec 12 '14 at 22:21
Wow, this sounds really promising, I'll take a look into Nelsen's book. Thank you very much – jannic Dec 13 '14 at 12:52

Generate random correlated categorical (ordinal) variables

1 Answers1

Linked