12

I'm looking for correlations between the answers to different questions in a survey ("umm, let's see if answers to question 11 correlate with those of question 78"). All answers are categorical (most of them range from "very unhappy" to "very happy"), but a few have a different set of answers. Most of them can be considered ordinal so let's consider this case here.

Since I don't have access to a commercial statistics program, I must use R.

I tried Rattle (a freeware data mining package for R, very nifty) but unfortunately it doesn't support categorical data. One hack I could use is to import in R the coded version of the survey which has numbers (1..5) instead of "very unhappy" ... "happy" and let Rattle believe they are numerical data.

I was thinking to do a scatter plot and have the dot size proportional to the number of numbers for each pair. After some googling I found http://www.r-statistics.com/2010/04/correlation-scatter-plot-matrix-for-ordered-categorical-data/ but it seems very complicated (to me).

I'm not a statistician (but a programmer) but have had some reading in the matter and, if I understand correctly, Spearman's rho would be appropriate here.

So the short version of the question for those in a hurry: is there a way to quickly plot Spearman's rho in R? A plot is preferable to a matrix of numbers because it's easier to eye ball and also can be included in materials.

Thank you in advance.

PS I pondered for a while whether to post this on the main SO site or here. After searching both sites for R correlation, I felt this site is better suited for the question.

wishihadabettername
  • 589
  • 2
  • 6
  • 18
  • 2
    You sound like R is inferior to propriety software. :) – Roman Luštrik Aug 25 '10 at 05:49
  • For me it sounds totally reasonable to use the pearson product-moment-correlation (assuming continuous data) in your case (assuming enough points on your scale and not a don't know midpoint). Whole fields within psychology (e.g., personality or social psychology) rest (successfully) on the assumption that answers to a single item on an e.g., five-point (or seven-point) scale ranging from very un-X to very X can be treated as continuous. See also this thread: http://stats.stackexchange.com/questions/539/does-it-ever-make-sense-to-treat-categorical-data-as-continuous – Henrik Aug 25 '10 at 09:59
  • @romunov: Not sure how you got the impression that I believe R is inferior to other s/w. But it's not the case at all. – wishihadabettername Aug 25 '10 at 12:45
  • I was just being a smart ass. I hope there's no hard feelings. :) – Roman Luštrik Aug 27 '10 at 10:41

2 Answers2

19

Another good visualization of correlation is offered by the corrplot package, giving you things like this: alt text

It is a great package.

Also have a look at the answer here, it might be good for you to know.

Lastly, if you have suggestions how the code on the post you referred to could be simpler - please let me know.

Tal Galili
  • 19,935
  • 32
  • 133
  • 195
  • 1
    Thanks Tal, I'll try corrplot now. I also wish I knew how to simplify your solution (which I linked to in the question) but I'm just a newbie in R so you know more than me. I'll update the question to clarify the solution looks complicated *to me* – wishihadabettername Aug 25 '10 at 04:04
  • The corrplot looks good. It gives a great visual snapshot of size and direction of correlations. In the case of 5-point ordered categorical variables, it might be useful to supply some other measure of association besides Pearson's correlation: e.g., polychoric correlations. The size of standard Pearson's correlations of ordered categorical variables is influenced somewhat by the mean of the two variables. – Jeromy Anglim Aug 25 '10 at 13:38
3

A couple of additional plotting ideas are:

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
  • The Sunflower is a fun solution. Using a jitter is what I tried when first I looked at the topic, but I found it do be not effective enough for the plotting of correlation matrixs... – Tal Galili Aug 25 '10 at 13:26
  • Yeah, jitter could get pretty messy with a scattermatrix with lots of variables. I suppose the benefit of jitter and sunflower is that you get to see the raw data (albeit perturbed in the jitter case). – Jeromy Anglim Aug 25 '10 at 13:40
  • Agreed (I love jitter, simply not for this :) ) – Tal Galili Aug 25 '10 at 16:32