0

I keep seeing that we can't use Pearson correlation for binary variables but I don't understand why. If instead of a binary response, I have multiple (>2) categories, then it's obvious - we are unsure about the ranking. However, let's just say we a binary gender (0- female, 1 - male) response and then some continuous variable (e.g. height)

B = matrix( 
c(0,0,0,0,0,0,1,1,1,1,1,1,150,165,160,157,170,155,168,169,172,180,190,176),
nrow=12,ncol=2) 

  [,1] [,2]
 [1,]    0  150
 [2,]    0  165
 [3,]    0  160
 [4,]    0  157
 [5,]    0  170
 [6,]    0  155
 [7,]    1  168
 [8,]    1  169
 [9,]    1  172
[10,]    1  180
[11,]    1  190
[12,]    1  176

cor(B[,1], B[,2])}
 0.7564467

Pearson correlation yields me a strong positive correlation of 0.76 - it seems only logical as men are taller in this sample. So why can't we use Pearson for variable preselection?

user3810441
  • 65
  • 1
  • 6
  • 2
    Note that the equivalent regression fits the unique line through the two means, in your case height for males and females. That's in turn equivalent to the simplest form of a Student's t test. So, this all makes sense. The problem is with whatever unnamed sources are (or seem to be) implying otherwise. – Nick Cox Oct 22 '17 at 19:43
  • 1
    See wikipedia's article on the [Phi coefficient](https://en.wikipedia.org/wiki/Phi_coefficient) and this question https://stats.stackexchange.com/questions/103801/is-it-meaningful-to-calculate-pearson-or-spearman-correlation-between-two-boolea ... clearly you *can* calculate it, and it is meaningful in a pretty direct sense (it's the phi coefficient -- a standard measure of association of binary variables). Really the only issue you would encounter is if you try to use the normal theory CIs and tests "as is" in small samples; I don't think they work – Glen_b Oct 23 '17 at 02:31

0 Answers0