I have two classification models, $C_1$ and $C_2$, and I want to do a hypothesis test to see if $C_1$ is significantly better than $C_2$. I am interested in a one-sided hypothesis test. Thus I have the following hypotheses:
$H_0: \epsilon_1 = \epsilon_2$
$H_1: \epsilon_1 < \epsilon_2$
where $\epsilon_i$ are the respective error rates. I have $n$ samples from a validation set that I want to use to either accept $H_0$ or reject $H_0$. One choice for a hypothesis test that seems appropiate in this setting is the McNemar exact conditional test (I choose it because its easy to analyze and implement). I have implemented the test myself, and I'm checking to see if my implementation is correct.
Now I understand that if the $H_0$ is true, the distribution of the p-values should be uniform if the test statistical is continuous (see this question). Clearly the exact conditional test is discrete, but still, I would expect at least a sort of uniform histogram if $H_0$ is true (otherwise how can we guarantee a FPR bounded by the confidence level alpha?).
I have implemented the test, and performed two experiments. I set $\epsilon_1 = \epsilon_2 = \frac{1}{2}$, by simply making each classifier output a random coin flip, and all labels are random coin flips as well. I use either $n=100$ or $n = 10000$ for the size of the validation set. I simulate the experiment 10000 times. I observe the following histograms for the p-values:
So for small values of $n$ the distribution looks really skewed. Can we conclude that there is something wrong with my implementation of the hypothesis test, or is this to be expected due to discretization? I would assume that if the discretization is the problem, the distribution would still more or less look uniform, and not become skewed. I repeated the experiment several times and the results seem consistently skewed for small $n$.
It seems someone had a related issue. This is why I'm no also looking at the empirical CDF of the p-values. This seems a bit more well-behaved, see below the empirical CDF for $n=100$: