1

I have a categorical variable for measured for two samples and I want to see if this variable differs significantly between the samples. I want to do a chi-squared test, with the samples as the columns and the categories of the variable as the rows (or vice versa). However, I have seen some people combining the two samples and then comparing one sample against this combined sample.

Is this correct? Are there any consequences of doing a sample A vs (sample A + sample B) comparison? My instinct is that you are less likely to find a significant difference between samples because you are comparing one sample against an average of the sample and the other.

EDIT: I'll provide an example illustrate my confusion. Here I run A vs B

m <- matrix(c(20, 10, 5, 10), nrow=2, dimnames=list(c("A", "B"), c("CatX", "CatY")))
m
  CatX CatY
A   20    5
B   10   10

chisq.test(m)

Pearson's Chi-squared test with Yates' continuity correction 
data:  m X-squared = 3.2512, df = 1, p-value = 0.07137

Then when I combine A and B

n <- matrix(c(20+10, 10, 5+10, 10), ncol=2, dimnames=list(c("A+B", "B"), c("CatX", "CatY")))
n
    CatX CatY
A+B   30   15
B     10   10
chisq.test(n)

Pearson's Chi-squared test with Yates' continuity correction 
data:  n X-squared = 0.99712, df = 1, p-value = 0.318

The p-value is much greater in (A+B) vs B in comparison to A vs B.

Reading the response to this question, I found this quote:

Thus, it comes out that chi-square tests the deviation of each of the two groups profiles from this average group profile, - which is equivalent to testing the groups' profiles difference from each other, which is the z-test of proportions.

If this is the case, then surely adding B to A decreases the difference between the two and therefore leads to larger p-values.

To clarify: if I want to test for independence between two categorical variables, is it ever suitable to compare one group verses a combination of the two? If so, what am I effectively doing by comparing one group to a combination of the two?

tcn
  • 11
  • 4
  • Logically, comparing A vs (A+B) is equivalent wrt conclusions with comparing A vs B, since A in both sides is the same. You may go on with what you were planning initially. For 2x2 frequency table the chi-square test of association is the [same](http://stats.stackexchange.com/q/173415/3277) as the z-test of two independent proportions. – ttnphns Jul 14 '16 at 19:59
  • 1
    @ttnphns That comment strikes me as potentially misleading, because this isn't a question about logic but about statistics. Statistically, comparing A to A+B is *very different* from comparing A to B: the former compares *strongly correlated* data while the latter does not. Because of that, the $\chi^2$ test simply does not apply to the former comparison. – whuber Jul 15 '16 at 15:05
  • @whuber, when I was commenting the initial (short) draft of the Q, I didn't imply that the A vs (A+B) comparison be _that same_ standard chi-square test that is applied to A vs B comparison. Neither I thought that the OP would be going that "straightforward" way of application. Instead, saying of "logic" I meant that A vs (A+B) comparison can be mathematically done (to account for the correlated samples) to produce the p-value of A vs B chi-sq test. Note that I didn't say "A vs C comparison where C=A+B" which would imply sample A "be lost in C". – ttnphns Jul 15 '16 at 15:37
  • (cont.) So, it is only after the OP edited the Q that my initial comment becomes "potentially misleading" from somebody's statistical point of view. Again to repeat: I didn't mean the same chi-square test to be applied in the two comparisons. I didn't mean just _that_ when was saying `go on with what you were planning initially`. – ttnphns Jul 15 '16 at 15:44
  • `If this is the case, then surely adding B to A decreases the difference between the two and therefore leads to larger p-values`. tcn, you are totally correct observing this. The group A+B is contaminated with A, so A is compared with something which is partly A itself. Normally we would not do such things. – ttnphns Jul 15 '16 at 17:03
  • (cont.) I'd want to clarity to you my 1st comment - what was at the back of my mind: some programs which tabulate percentages of some responses (say, rows) by groups (columns), when requested to compare the groups compare them not pairwisely with each other but instead compare each group with the column Total (which, of course, represent all the groups combined in one). But the comparison of a group with the total actually is performed with the total without that group in the total - which is the correct way. – ttnphns Jul 15 '16 at 17:04
  • (cont.) E.g. A is compared not with (A+B+C+D) but actually with (A+B+C+D-A). That is what I implied saying `A vs (A+B) is equivalent wrt conclusions with comparing A vs B`. – ttnphns Jul 15 '16 at 17:04

0 Answers0