A colleague and I recently got into a conversation/disagreement regarding permutation testing (e.g. for comparing the mean value of two populations). While i thought i had a fairly good understanding of it, the fact that i cannot convince him (or him me) makes me think that i am missing a few things. Maybe someone here could provide key insights or links that my colleague and i could go through in order to come to a common understanding.
Background/context
Let us assume we have two populations $X_1$ and $X_2$ with means $\mu_1$ and $\mu_2$, respectively. We want to perform permutation testing to check whether their respective mean values are equal or not. The null hypothesis $\mathcal{H}_0$ in this case is that $ \mu_1 = \mu_2 $.
Following the traditional permutation process, we:
- Calculate the point-estimate of the empiric means difference $\hat{t}_\mu = \hat{\mu}_1 - \hat{\mu}_2$.
- For a (preferably large) number of times, we $(i)$ shuffle/randomize samples across populations, $(ii)$ compute the mean difference between the shuffled populations and $(iii)$ bin those values into an histogram, essentially approximating the distribution of $t_\mu$ (the difference of the means) under the assumption that $\mathcal{H}_0$ is true.
- Check what is the probability of observing $\hat{t}_\mu$ under $\mathcal{H}_0$ from the distribution obtained from 2, which is our p-value $p$ based on which we will reject or not $\mathcal{H}_0$ for the data at hand.
Point of disagreement
The point of disagreement arises in Step 2.
- From my understanding, the explicit purpose of shuffling/randomizing samples across both populations is to break whatever statistical differences might actually exist between our two populations. By doing so, we make $\mathcal{H}_0$ "virtually true" (whether it is or not for our data) through permutation/randomization/shuffling in order to compute the distribution of $t_\mu$ under such condition. Thus the p-value obtained in Step 3 follows the exact definition of what a p-value is, i.e. the chance of observing our empiric difference given $\mathcal{H}_0$ is true.
- As to my colleague's understanding, i dont know if i can do a good job unfolding it, since i am not sure i understood it when we tried to discuss it (he could probably say the same for me). I will probably ask him to pitch in this thread on Monday. However, it is fair to say that according to him, the randomization/shuffling step is NOT carried out in order to make $\mathcal{H}_0$ "virtually true" and compute the distribution of our test statistic under the null hypothesis. I can reasonably quote him arguing that "permutation testing works because permutation/randomization/shuffling does not break the statistical differences between the two populations if these differences actually exist".
- Edit following comments (23 jan 2022). Here's a tentative rephrasing of what i (thought i) meant, augmented with comments from other users. What we essentially want to compute is the distribution of our test statistic $t_\mu$ under $H_0$. Under $H_0$, the allocation of the observations into the two groups is random, and the likelihood of observing our "observed populations" is the same as the likelihood of observing any other permuted version of the samples they contain. By carrying a large number of permutations (followed every time by the calculation of the resulting test statistic's value), we approximate/compute what the sampling distribution of our statistic $t_\mu$ looks like under the null hypothesis. Finally, by measuring the probability of observing a value of $\hat{t}_\mu$ using that previously approximated distribution, we may decide to refuse $\mathcal{H}_0$.
While i feel i can correctly unfold my reasoning in my head and it seems to make logical sense (at least to me), for some reason i was completely unable to get my point across. I guess what i am looking for is a clear explanation or disproval of either one of our points of view regarding Step 2 and the role that permutation/randomizing/shuffling plays in this process.
I looked online and found several lectures/pdfs, but i feel like they more or less provide the same granularity of explanations that i point out in this post, without explicitly adressing the "why and how" of the permutation step.
Going a bit beyond
A further point of discussion where we both felt a bit clueless arises when it comes to testing something else than classic "equality of the means". For example, let's say that i now want to test if $ \mu_1 > \mu_2$. The null hypothesis $\mathcal{H}_0$ in this case is that $ \mu_1 \leq \mu_2 $.
I have seen ressources online saying that in that case, we can follow the procedure described in the first section of this post, and a single tailed p-value can be obtained by only looking at the positive tail of the distribution computed in Step 2 (instead of both tails when testing equality of the means).
What bothers me is that in this case, random permutation/shuffling across groups makes $ \mu_1 = \mu_2$ "virtually true", but i feel we are "missing" all the cases where $ \mu_1 < \mu_2$ to properly cover the full range of $\mathcal{H}_0$ and compute the distribution of our test statistic under the assumption that $\mathcal{H}_0$ is true. In that case, shouldn't we adapt our randomization process to go further than just random shuffling, but random shuffling ensuring (probabilisticaly) that $ \mu_1 \leq \mu_2$?
I apologize for the long post, but i hope it might help other non-statisticians who want to use permutation testing someday. Any feedback, comment or link to explore would be greatly appreciated!