I have a categorical variable for measured for two samples and I want to see if this variable differs significantly between the samples. I want to do a chi-squared test, with the samples as the columns and the categories of the variable as the rows (or vice versa). However, I have seen some people combining the two samples and then comparing one sample against this combined sample.
Is this correct? Are there any consequences of doing a sample A vs (sample A + sample B) comparison? My instinct is that you are less likely to find a significant difference between samples because you are comparing one sample against an average of the sample and the other.
EDIT: I'll provide an example illustrate my confusion. Here I run A vs B
m <- matrix(c(20, 10, 5, 10), nrow=2, dimnames=list(c("A", "B"), c("CatX", "CatY")))
m
CatX CatY
A 20 5
B 10 10
chisq.test(m)
Pearson's Chi-squared test with Yates' continuity correction
data: m X-squared = 3.2512, df = 1, p-value = 0.07137
Then when I combine A and B
n <- matrix(c(20+10, 10, 5+10, 10), ncol=2, dimnames=list(c("A+B", "B"), c("CatX", "CatY")))
n
CatX CatY
A+B 30 15
B 10 10
chisq.test(n)
Pearson's Chi-squared test with Yates' continuity correction
data: n X-squared = 0.99712, df = 1, p-value = 0.318
The p-value is much greater in (A+B) vs B in comparison to A vs B.
Reading the response to this question, I found this quote:
Thus, it comes out that chi-square tests the deviation of each of the two groups profiles from this average group profile, - which is equivalent to testing the groups' profiles difference from each other, which is the z-test of proportions.
If this is the case, then surely adding B to A decreases the difference between the two and therefore leads to larger p-values.
To clarify: if I want to test for independence between two categorical variables, is it ever suitable to compare one group verses a combination of the two? If so, what am I effectively doing by comparing one group to a combination of the two?