Problems with Fisher's Method for combining p-values

Question

I am using Fisher's Method to combine p-values, and have noticed some strange behavior for large p-values and large $n.$

In my case I have a large number of not statistically significant results(e.g. .1 to .5), and I am using Fisher's Method to combine them. However, I noticed that Fisher's Method seems to display unstable behavior for these large p-values. Thus, changing the p-values from .367 to .368 resulted in drastic changes for the combined p-value. Why is this?

p_value=fisherIntegration(rep(.367,10000000)
#p_value=1.965095e-14
p_value=fisherIntegration(rep(.368,10000000)
#pvalue=0.8499356

In contrast, for low p-values and small $n,$ this behaved very nicely. For example:

p_value=fisherIntegration(rep(.05,10))
#pvalue=7.341634e-06

Here is the function I use for Fisher integration:

fisherIntegration  <- function (vector){
    my_length=length(vector)
    deg_free=my_length*2
    y=-2*sum(log(vector))
    p.val <- 1-pchisq(y, df = deg_free);
    p.val=as.numeric(p.val);
    return(p.val)

}

EDIT This post is somewhat related but does not address why .367 is a magic number in this context: Why does Fisher's method yield $p\gg 0.5$ when combining several p-values all equal to $0.5$?

Have you noticed that $0.367\lt e^{-1} \lt 0.368$? (That would be the only point of an exercise that purports to combine $10^7$ p-values in this fashion: it has no statistical use.) — whuber, Apr 13 '18 at 16:05
I didn't notice that. I'll bet that this has something to do with the weird behavior, but I am not sure why. — Josh, Apr 13 '18 at 16:38
From the other direction, what's the mean of the chi-square distribution? — Scortchi - Reinstate Monica, Apr 13 '18 at 17:03
I think you may find this Q&A interesting especially Christoph Hanck's answer https://stats.stackexchange.com/questions/243003/can-a-meta-analysis-of-studies-which-are-all-not-statistically-signficant-lead — mdewey, Apr 13 '18 at 17:05

score 4 · Accepted Answer · answered Apr 13 '18 at 17:21

As explained at https://stats.stackexchange.com/a/314739/919, Fisher's Method combines p-values $p_1, p_2, \ldots, p_n$ under the assumption they arise independently under null hypotheses with continuous test statistics. This means each is independently distributed uniformly between $0$ and $1.$ A simple calculation establishes that $-2\log(p_i)$ has a $\chi^2(2)$ distribution, whence

$$P = \sum_{i=1}^n -2\log(p_i)$$

has a $\chi^2(2n)$ distribution. For large $n$ (as guaranteed by the Central Limit Theorem) this distribution is approximately Normal. It has a mean of $2n$ and variance of $4n,$ as we may readily calculate.

Suppose, now, that $P$ is "much" different than this mean. "Much" means, as is usual, in comparison to the standard deviation. In other words, suppose that $P$ differs from $2n$ by more than a few multiples of $\sqrt{4n}=2\sqrt{n}.$ From basic information about Normal distributions this implies that $P$ is either unusually small or unusually large. Consequently, as $P$ ranges from $2n-2K\sqrt{n}$ to $2n+2K\sqrt{n}$ for $K \approx 3,$ Fisher's method assigns a cumulative probability (that is, combined p-value) ranging from nearly $0$ to nearly $1.$

In other words, all of the "interesting" probability for $P$ occurs within the interval $(2n-2K\sqrt{n}, 2n+2K\sqrt{n})$ for small $K$. As $n$ grows, this interval narrows relative to its center (at $2n$).

One conclusion we could draw from this result is that when $\sqrt{n}$ is large enough to dominate $2K$--that is, when $n$ is much larger than $(2\times3)^2\approx 40$ or so, then Fisher's Method may be reaching the limits of its usefulness.

In the circumstances of the question, $n=10^7.$ The interesting interval for the average log p-value, $-P/(2n),$ therefore is roughly

$$-(2n-2K\sqrt{n}, 2n+2K\sqrt{n})/(2n) \approx (-0.999051, -1.00095)$$

when $K=3.$

The corresponding geometric mean p-values are

$$e^{-0.999051} = 0.368229\text { and } e^{-1.00095} = 0.367531.$$

The lower value of $0.367$ used in the question is outside this interval, giving essentially zero (lower) tail probability, while the upper value of $0.368$ lies within this interval, giving a probability that is still appreciably less than $1.$ This is an extreme example of our previous conclusion, which could be restated like this:

When the average natural logarithm of the p-values differs much from $-1,$ Fisher's Method will produce a combined p-value extremely near $0$ or near $1$. "Much" is proportional to $1/\sqrt{2n}.$

Based on this answer, would you argue that stouffer integration is more appropriate in cases of large n? — Josh, Apr 13 '18 at 19:16
I believe that since such a huge amount of information is discarded in combining large numbers of p-values, and because the result with large $n$ is sensitive to the assumption of independence (which rarely truly holds), *no* method of combining them into a single decision is suitable in most circumstances. Stouffer's method scarcely differs from Fisher's method anyway. — whuber, Apr 14 '18 at 17:17
I don't agree, in that at least Stouffer integration does not display this strange "threshold" behavior. As far as I can tell, passing a vector of zscores consistently above 0 (e.g. 1000 zscores equal to .5) will always produce a final zscore above the original, which is logical. Fisher's method here is in my mind a 'bug' — Josh, Apr 17 '18 at 14:35
Whatever the differences might be, neither method was either intended for nor is useful for combining millions of p-values. In their areas of useful application they tend not to differ much. There's no "bug" in Fisher's approach: it's perfectly accurate, given its assumptions and its objective. Stouffer's is a little *ad hoc,* implicitly relying on additional assumptions. To be more constructive: when you have lots of (independent) p-values, you will get far more information out of them by studying how their distribution departs from uniformity than you will from any single combined statistic. — whuber, Apr 17 '18 at 14:41
Ok. I don't really agree with you regarding Fisher's method. Similar to the concrete example we discussed "fisherIntegration(rep(.367,1000))=.4999" but "fisherIntegration(rep(.367,10000000))=1.965095e-14" is intuitively silly. Any method can be justified given its assumptions/objectives, but in this case this kind of threshold dependent behavior would not fit what most users would find reasonable. Of course, I agree with you that a single summary statistic will be worse than more carefully examining the distribution. — Josh, Apr 17 '18 at 15:06
There's nothing to disagree about concerning the mathematical validity of Fisher's Method. When mathematics conflicts with intuition, it's almost always time to modify one's intuition: that's how understanding grows. In this case I suspect what your intuition might really be telling you is that it's practically useless to assume that ten million p-values are all independent and come from a perfectly uniform distribution. — whuber, Apr 17 '18 at 15:33

Problems with Fisher's Method for combining p-values

1 Answers1

Linked