I have some data that should be randomly assigned to treatment $T$, and am running some tests on observables to give evidence that this is indeed the case.
Let's focus on an outcome I'll call $X$, which is discrete, so the most natural test of if $X \perp T$ is Pearson's $\chi^2$. The Pearson test (via chisq.test
in R) gives a $p$-value of .008, which at first glance indicates a fair amount of correlation between $X$ and $T$. Code was specifically:
library(data.table)
analysis_data[ , chisq.test(x, treatment)$p.value]
However, there is a fair amount of meaningful clustering in the data, which is correlated with $X$ (e.g., as cluster size increases, average $X$ decreases).
To deal with this, I tried to implement a blocked bootstrap as follows:
Assign cluster ID number
For each iteration $b=1,\ldots,B$:
a. draw $n$ cluster IDs at random (with replacement, where $n$ is the total number of clusters and create a sample consisting of those $n$ clusters of observations.
b. Calculate the Pearson's $\chi^2$ statistic (via Wikipedia) within each sample via
chisq.test(x, treatment)$statistic
Calculate $\tau_0$, the $\chi^2$ value in the original sample.
Calculate the $p$-value as the proportion of $\tau_b$ which exceed $\tau_0$, i.e.
$$p = \frac{1}{B}\sum_{b=1}^B \mathbb{1}[\tau_b > \tau_0]$$
(in code:
analysis_data[ , cluster_id := .GRP, by = cluster_vars]
setkey(analysis_data, cluster_id)
test_dist <- replicate(BB, analysis_data[
.(sample(unique(cluster_id), rep = TRUE)),
chisq.test(x, treatment)$statistic])
t0 <- analysis_data[, chisq.test(x, treatment)$statistic]
mean(test_dist > t0)
)
When I do this, the $p$ value I get out (with BB
= 10000) is .81. That strikes me as an almost unbelievably large change.
What's more, when I restrict attention to the subsample of size-1 clusters, I am met again with a large (though partially mediated) difference between the bootstrapped and the parametric result-- .28 from chisq.test
vs. .88 from my bootstrap procedure.
Am I doing something wrong? What might be causing such large-magnitude changes in $p$-value between the procedures?