Reconciling wildly different $\chi^2$ independence test results (bootstrap clustering)

Question

I have some data that should be randomly assigned to treatment $T$, and am running some tests on observables to give evidence that this is indeed the case.

Let's focus on an outcome I'll call $X$, which is discrete, so the most natural test of if $X \perp T$ is Pearson's $\chi^2$. The Pearson test (via chisq.test in R) gives a $p$-value of .008, which at first glance indicates a fair amount of correlation between $X$ and $T$. Code was specifically:

library(data.table)
analysis_data[ , chisq.test(x, treatment)$p.value]

However, there is a fair amount of meaningful clustering in the data, which is correlated with $X$ (e.g., as cluster size increases, average $X$ decreases).

To deal with this, I tried to implement a blocked bootstrap as follows:

Assign cluster ID number
For each iteration $b=1,\ldots,B$:

a. draw $n$ cluster IDs at random (with replacement, where $n$ is the total number of clusters and create a sample consisting of those $n$ clusters of observations.

b. Calculate the Pearson's $\chi^2$ statistic (via Wikipedia) within each sample via chisq.test(x, treatment)$statistic
Calculate $\tau_0$, the $\chi^2$ value in the original sample.
Calculate the $p$-value as the proportion of $\tau_b$ which exceed $\tau_0$, i.e.

$$p = \frac{1}{B}\sum_{b=1}^B \mathbb{1}[\tau_b > \tau_0]$$

(in code:

analysis_data[ , cluster_id := .GRP, by = cluster_vars]
setkey(analysis_data, cluster_id)
test_dist <- replicate(BB, analysis_data[
  .(sample(unique(cluster_id), rep = TRUE)),
  chisq.test(x, treatment)$statistic])
    t0 <- analysis_data[, chisq.test(x, treatment)$statistic]
mean(test_dist > t0)

)

When I do this, the $p$ value I get out (with BB = 10000) is .81. That strikes me as an almost unbelievably large change.

What's more, when I restrict attention to the subsample of size-1 clusters, I am met again with a large (though partially mediated) difference between the bootstrapped and the parametric result-- .28 from chisq.test vs. .88 from my bootstrap procedure.

Am I doing something wrong? What might be causing such large-magnitude changes in $p$-value between the procedures?

Reconciling wildly different $\chi^2$ independence test results (bootstrap clustering)

0 Answers0