0

My data looks like this:

Color | X_1 X_2 ... X_n
-----------------------
red   | 0.5 0.9 ... 0.2
green | 0.7 0.7 ... 0.3
red   | 0.8 0.3 ... 0.2
blue  | 0.7 0.4 ... 0.2
...

I want to test for a correlation between the categorical variable Color and each interval variable X_i. What is the best way to calculate this (in R)?

For the sake of full disclosure, I'm trying to generate a simulation showing how easy it is to find spurious correlations when you have a small number of data points with a large number of features.

rhombidodecahedron
  • 2,322
  • 3
  • 23
  • 37
  • 1
    You have n X variables. Are you saying how to correlate any particular one w/ color (eg, cor(color, x_1)) or the combination of all of them? In addition, your categorical variable is understood to have >2 levels, right? (Note you have 2 'reds'.) – gung - Reinstate Monica Mar 27 '14 at 03:34
  • Sounds like a case for multivariate ANOVA, sort of...that is, I'd recommend a univariate ANOVA for any one interval variable, but since it sounds like you're conducting $n$ tests of the null hypothesis that color is unrelated to $X_i$, it seems appropriate to control for false alarm error inflation due solely to the number of tests. Depends partly on what kind of data you intend to simulate though. Several ways of violating MANOVA assumptions exist, and you may intend some of them. – Nick Stauner Mar 27 '14 at 03:51
  • @gung, yes, any individual one, (eg, cor(color, x_1)). And yes, there are many instances of each categorical variable. – rhombidodecahedron Mar 27 '14 at 04:13

0 Answers0