Fisher's method and Stouffer's approach (z-transform) for meta-analysis roughly follow the same scheme:
- take the p-values from some different experiments on the same hypotheses
- apply to them some monotonic transformation
- sum the transformed values together
- check the result against some well-known distribution.
Now, my question is: what is the rationale for step 2? P-values are (under the null hypothesis) uniformly distributed over [0,1]. The sum of $n$ p-values hence already tends to a normal distribution (and the approximation is fairly good already for small values of $n$).
This is not just a theoretical curiosity. The p-values I am working with come from Mann-Whitney tests (but many other examples could be made). There is a strictly positive probability (under the null, but also under my H1) that the (one-tailed) p-value for a given test is 1. If this happens, the aggregated p-value according to the Z-transform method is also 1 (unless there was some test which yielded a p-value of 0 - let us assume this is not the case), simply because the normal distribution has $cdf(c)=1$ for $x \to + \infty$. And this is true whatever the number of p-values we are aggregating! To make an extreme example, if one p-value is 1, the other 100 are 0.1% and I know that a p-value of 1 has a 10% of probability to arise by chance, my intuition is that the null hypothesis should be rejected - instead the aggregated p-value is 1.
What is wrong with just summing up the p-values, and then comparing them to the appropriate normal (or Irwin–Hall, or even a discrete version of it, if it's a matter of accuracy) distribution?!
Even in cases in which the above problem does not arise, I fail to see why extreme p-values should be given higher importance: and if there is a reason, then how much importance should be given to them (i.e. how the specific transformations used at step 2 can be justified).