R procedure for comparing multiple categorical variables (similar to anova() followed by t.test() for continuous)?

Question

Big Picture:

How can I implement partitioned Chi Square in R? I understand how to perform the overall Chi square, and then how to get individual parameters (observed counts, expected counts, residuals, etc.). However, I don't understand how to get p values for each individual comparison.

Details:

set.seed(200)
alpha = 0.05

I have three groups of patients and I have both categorical and continuous data collected on those patients.

Group <- c(rep('A', 'B', 'C'), 10))
Mass <- c(rnorm(10, mean = 60, sd = 1),
          rnorm(10, mean = 70, sd = 1),
          rnorm(10, mean = 80, sd = 1))
Sex <- rep(c('Male', 'Female'), 15)
data <- data.frame(Group, Mass, Sex)

I'd like to construct your basic "Table 1" showing whether these groups differ from one another with respect to these variables.

For the continuous variables, I would do an anova test. If p were less than alpha, I would do individual t.tests and interpret my p values using a Bonferroni correction.

print('Anova p-value:')
print(anova(lm(Mass ~ Group, data))$'Pr(>F)'[1])

print(t.test(Mass ~ Group, data[data$Group != 'A',])$p.value)
print(t.test(Mass ~ Group, data[data$Group != 'B',])$p.value)
print(t.test(Mass ~ Group, data[data$Group != 'C',])$p.value)

I'd like an analogous output for my discrete variables. I know that I can do a Chi square test on the data as a whole...

print(chisq.test(x = data$Group, y = data$Sex)[['p.value']])

...but I'm confused about the correct way to get and interpret individual p values. I found this question, which links to a powerpoint about "partitioning" the chi square test, but I'm having trouble following it.

Multiple Chi-Squared Tests

score 3 · Answer 1 · answered Mar 29 '18 at 20:40

As I understand this question, it is about R programming and not about statistics and thus does not qualify for this site. Since it is a relatively frequent problem, here a very simple R function that does the job.

pairwise.chisq.test <- function(x, g, p.adjust.method = p.adjust.methods, ...) {
  DNAME <- paste(deparse(substitute(x)), "and", deparse(substitute(g)))
  g <- factor(g)
  p.adjust.method <- match.arg(p.adjust.method)

  compare.levels <- function(i, j) {
    xi <- x[as.integer(g) == i]
    xj <- x[as.integer(g) == j]
    chisq.test(table(xi, xj), ...)$p.value
  }
  PVAL <- pairwise.table(compare.levels, levels(g), p.adjust.method)
  ans <- list(method = "chi-squared test", data.name = DNAME, p.value = PVAL, 
              p.adjust.method = p.adjust.method)
  class(ans) <- "pairwise.htest"
  ans
}

x: categorical "response", e.g. "Sex"
g: grouping variable, e.g. "Group"
...: Arguments passed to chisq.test

Let's try it out with your data set

# Input
Group <- rep(LETTERS[1:3], 10)
Sex <- rep(c('Male', 'Female'), 15)

pairwise.chisq.test(Sex, Group, p.adjust.method = "none")

# Output
    Pairwise comparisons using chi-squared test 

data:  Sex and Group 

  A     B    
B 0.011 -    
C 0.011 0.011

P value adjustment method: none 

# Input
pairwise.chisq.test(Sex, Group, p.adjust.method = "bonferroni")

# Output
    Pairwise comparisons using chi-squared test 

data:  Sex and Group 

  A     B    
B 0.034 -    
C 0.034 0.034

P value adjustment method: bonferroni 

# Now by simulation
set.seed(39)
pairwise.chisq.test(Sex, Group, p.adjust.method = "none", simulate.p.value = TRUE, B = 100000)

# Output

    Pairwise comparisons using chi-squared test 

data:  Sex and Group 

  A      B     
B 0.0083 -     
C 0.0080 0.0078

P value adjustment method: none

SKOR2 · Answer 2 · 2018-03-29T23:04:46.740

EDIT: I answered this question when it was worded differently - to perform a chi square test for 3 variables (2 categorical, 1 continuous)

Why are you not using ANOVA? I don't normally work with categorical data, but I thought that ANOVA was the appropriate model to use when you have categorical predictors and continuous dependent variables. I understand that Mass doesn't necessarily need to be the DV here but not how the Chi Square is supposed to cope with non-categorical data.

If you still want to proceed with Chi Square, we need to make 3 modifications: to the type of variable for Mass, to the format of data, and to the function we will use.

1. Reformatting `data` into a 3-D contingency cube

To create the contingency table, we need to make the Mass variable categorical rather than continuous. We can't compare frequencies for every point along a continuous distribution; we have to bin. I will use the dplyr package to create a 2-D matrix representation of the data and then the array function to create the 3-D contingency table we need.

Group <- rep(x = c("A", "B", "C"), times = 10)
Sex <- rep(x = c('Male', 'Female'), times = 15)
Mass <- c(rep(60,10),
          rep(70,10),
          rep(80,10))
data$Mass = as.factor(data$Mass) #important for later that Mass is not numeric
data <- data.frame(Group, Mass, Sex)
data <- data %>%
  count(Group, Sex, Mass)

This creates a matrix with 18 frequency cells (data$n) because 3 (Mass) X 3 (Group) 2 (Sex).

data
# A tibble: 18 x 4
   Group Sex    Mass      n
   <fct> <fct>  <fct> <int>
 1 A     Female 60        2
 2 A     Female 70        1
 3 A     Female 80        2
 4 A     Male   60        2
 5 A     Male   70        2
 6 A     Male   80        1
 7 B     Female 60        2
 8 B     Female 70        2
 9 B     Female 80        1
10 B     Male   60        1
11 B     Male   70        2
12 B     Male   80        2
13 C     Female 60        1
14 C     Female 70        2
15 C     Female 80        2
16 C     Male   60        2
17 C     Male   70        1
18 C     Male   80        2

Because we have 3 variables, however, we need this information in a cube, which is where array comes in.

data <- array(data = data,
            dim = c(3, 2, 3),
            dimnames = list(c("A", "B", "C"),
                            c("Male", "Female"),
                            c("60", "70", "80")))

This creates the cubic contingency table we're looking for. Broken up into 3 sides of 6 for each Mass (60,70, & 80):

, , 60

  Male Female
A    2      2
B    1      2
C    2      1

, , 70

  Male Female
A    2      1
B    2      2
C    1      2

, , 80

  Male Female
A    1      2
B    2      1
C    2      2

2. The Cochran-Mantel-Haenszel test for this 3-D data

For 3-D contingencies, Stack Overflow user mtoto recommends the Cochran-Mantel-Haenszel test.

Correct method aside, mantelhaen.test is also our only option here because chisq.test can't handle 3-D arrays.

chisq.test(data)
Error in chisq.test(data) : invalid 'x'

mantelhaen.test(data)

    Cochran-Mantel-Haenszel test

data:  data
Cochran-Mantel-Haenszel M^2 = 0, df = 2, p-value = 1

The mantelhaen.test returns the same overall result as default chisq.test. I'm not sure how a work-around would look. I suppose you could break up the 3-D array into 3 2-D contingency tables, which you could then pass to chisq.test. This would be very tricky to set up with the array indices.