Why does Stouffer's method work?

Question

It seems like a fairly straightforward question, but when I really think about it, Stouffer's method doesn't make sense to me. This is why:

Assume a two-tailed hypothesis. You first calculate $z_i$ from $p$-values. So let's take a fairly simple example. Let's take two $p$-values of $0.05$. This means that $z_1$ and $z_2$ are both $\approx1.96$. According to Stouffer's method, $z_1$ and $z_2$ are combined such that:

$$ Z = \frac{\sum\limits_{i=1}^kZ_i}{\sqrt{k}} = \frac{1.96 + 1.96}{\sqrt{2}} = 2.77 $$

This $z$-score then gets converted to a $p$-value once again, resulting in a $p$-value of $0.005$, whereas the $p$-values from each $z_i$ individually is about $0.05$.

In this sense, it seems as though Stouffer's test artificially changes the resultant $p$-value to a value dissimilar to the $p$-values of each $z_i$, which to me, doesn't make sense.

Am I misunderstanding this test or can someone help me understand how / why it works?

(+1) But please note that Stouffer's method in this form is not appropriate for two-tailed alternatives. The problem is that it overlooks the possibility that one study might have found an effect in one direction and the other, an effect in the opposite direction. One has to check that this has not occurred. To get to your question: in what sense is this "artificial"? Bear in mind the purpose is to *combine evidence* to support decision making. Doesn't it make sense that two significant results ought to constitute stronger support for a decision than either one alone? — whuber, Jul 28 '15 at 20:08
When I wrote that it seems "artificial," I meant that in the case that there are two samples (N = 2), there will always be an inflation in the Z-score, resulting in consistently lower p-values than expected from either z-score ($z_i$). While it does make sense that two significant results should result in a stronger support for a decision than either one alone, it doesn't make sense for two p-values to be implemented into Stouffer's method and the result be completely different from either p-value. — will, Jul 28 '15 at 20:17
@whuber: your comment perfectly answers the question (+1). Maybe you could copy paste it to an answer so I can remove mine (quickly written on mobile.) — Michael M, Jul 28 '15 at 20:23
@MichaelM, I'm sorry, but I don't quite understand how the comment answers the question? Still doesn't make sense to me that the p-value would be deflated by Stouffer's method. — will, Jul 28 '15 at 20:26
@will, I cannot understand the last sentence of your first (long) comment here. Yes, it does make sense that two significant results yield a stronger support when combined. Which means that the combined p-value can well be lower than either of the two. So what's the problem? — amoeba, Jul 28 '15 at 20:40
@Michael Thank you for your encouragement, but I feel my comment is too vague and qualitative to be suitable as a full answer. Among other things, a good answer would explain why this method of combining p-values makes any (quantitative) sense. Based on the comments that are accumulating, it also looks like some explanation of what p-values mean might also be helpful. — whuber, Jul 28 '15 at 20:43
@amoeba Oh, I see now. I was expecting that Stouffer's method would return a p-value that is similar to that of both p-values associated with $z_i$. However, what if the two results are insignificant? Take two p-values at 0.95, which is associated to z-scores of 0.063. $$Z = \frac{z_1+z_2}{\sqrt{2}} = \frac{0.063+0.063}{\sqrt{2}} = 0.089$$. This z-score is associated with a p-value of 0.92, which indicates that these two insignificant p-values when combined through Stouffer's method results in a stronger support than either two individually. Does this also make sense? Less significant -> more — will, Jul 28 '15 at 20:53
@whuber I understand your first point now. However, it still doesn't make sense to me that two insignificant p-values when combined through Stouffer's method will result in a more significant p-value. Higher p-values combined results in a lower p-value = higher significance / lower probability of obtaining the results. — will, Jul 28 '15 at 21:03
I was thinking that one way to develop your intuition would be to reverse this procedure: take a single study and *split* it into two random parts, then analyze each part separately. As a very simple example, consider a post-election survey in which 1000 people were polled and 535 said they voted for the incumbent and 465 for her opponent. A random split might go 265-235 in one half and 270-230 in the other half. What are the p-values for the test of equality of proportions in the two halves and what is the p-value overall? (In `R`, compute using `prop.test(535,1000)`, etc.) — whuber, Jul 28 '15 at 21:44
Hmm, not sure if I understand what you're getting at, but the p-values for the test of equality of proportion in each half is 0.53 and 0.54, respectively. The overall p-value is 0.535. — will, Jul 28 '15 at 22:31
You seem to confuse the sample estimate of the proportion with the p-value of the test!! The overall p-value is 0.03 while the p-values of the two halves are 0.08 and 0.19. — whuber, Oct 15 '18 at 18:51

Michael M · Answer 1 · 2015-07-28T20:10:21.647

8

The higher overall sample size leads to a higher power and thus to a smaller p value (at least if the working hypothesis is supported by the data).

This is usually the main point of any meta analysis: multiple weak evidences supporting a hypothesis are combined to strong evidence for it.

edited Jul 28 '15 at 20:10

answered Jul 28 '15 at 20:08

Michael M

10,553
5
27
43

Since the statistical term "power" in this context has a sharply different meaning than p-value, I am concerned that this explanation could cause some confusion among them. – whuber Jul 28 '15 at 20:10
So does this mean that in the case that the sample size is 2, the power of Stouffer's method will always lowered, and that the p-value will always be smaller? How can get a more accurate answer when the sample size is two? – will Jul 28 '15 at 20:19
The "meta sample" size is two, i.e. there were two experiments who both yielded $p=0.05$. The combined sample size $N$ is $N=N_1+N_2$, so typically much larger than 2. Since this meta analysis only takes into account the p values, the information available is much lower than from the raw data of the $N_1+N_2$ events. – quazgar Oct 15 '18 at 17:44

score 2 · Answer 2 · answered Apr 24 '17 at 13:01

For simplicity think in terms of a test on means. Suppose under H0 the treatment effect is zero, so that each z value is a weighted estimate of the treatment effect θi. Stouffer's method gives an unweighted average of these treatment effects so will give a more precise estimate (and hence smaller p-value) than each separate z value. This unweighted estimate of the treatment effect is biased but a weighted Stouffer's method is possible, and if the weights are proportional to 1/standard error(θi) the treatment effect estimate is unbiased. This only makes sense however if the separate z values are measures of the same quantity. An advantage of Stouffer's and Fisher's methods is that they can also be applied to meta-analyses where different response variables have been chosen - so they can't be averaged - but where a consistency of direction of effect can still be discerned.

score 0 · Answer 3 · answered Oct 15 '18 at 18:29

Think of it from a meta-analysis point of view: If there was no effect ($H_0$), $p$ values would be equally distributed between 0 and 1. So if you get $p<0.1$ in more than 10% of all single analyses (potentially many of them), this can amount to the conclusion that $H_0$ probably should be rejected.

I do not even see a problem for two-tailed tests: In this case the result should be interpreted as: It is unlikely that the true mean is 0 (in the example of a gaussian around 0), but I cannot tell (from either the previous or the combined $p$ value) if the true mean is above or below it.

score -2 · Answer 4 · answered Nov 19 '17 at 20:06

-2

I think it'd be fine to combine 2-tailed results because that means that the result would amount to zero (if there is evidence that the treatment enhances [right-tail] the disease of a patient but also evidence that it worsens [left-tail], the net result is no evidence towards a particular hypothesis since they cancel out and more observations are needed.

answered Nov 19 '17 at 20:06

gah

1

1

I do not think that this addresses the question. Also, whuber's comment indicates that this particular method does not work for 2-tailed tests. – mkt Nov 19 '17 at 22:40

Why does Stouffer's method work?

4 Answers4