Why is Stouffer's method often used with one-tailed $p$-values rather than two-tailed $p$-values?

Question

Why is Stouffer's method so often performed on $z$'s that correspond to one-tailed $p$-values, when the mathematics allows for $z$'s that correspond to two-tailed $p$-values?

Joel, I've noticed you've asked several questions (and gotten several answers with upvotes) but never accepted an answer. Consider upvoting and/or accepting answers you've found helpful. — Macro, Jan 17 '12 at 14:45
Can you clarify your question a little bit, maybe even providing an example (or a reference to one) that you're thinking of? In particular, it's not clear what you mean by "one-tailed $p$-values" and "two-tailed $p$-values" here since Stouffer's method may be combining $p$-values for tests that have nothing to do with $t$-tests. — cardinal, Jan 17 '12 at 15:09
@Joel: I was wondering if you might be interested in an alternate answer to your question. My (perhaps overly earnest) hope was that the current answer would be updated, but that doesn't appear to have happened yet. — cardinal, Feb 05 '12 at 19:42

score -4 · Accepted Answer · answered Jan 17 '12 at 14:30

-4

Suppose the null hypothesis $\mu = 0$ is considered in two $2$-tailed studies. Suppose that one study rejects the null hypothesis because all data are strongly positive (supporting the alternative hypothesis $\mu > 0$ as well as the alternative hypothesis $\mu \neq 0$), while the other study rejects the null because all the data are strongly negative (supporting the alternative hypothesis $\mu < 0$ as well as the alternative hypothesis $\mu \neq 0$). Clearly, if the data from the two studies were combined, the null hypothesis would be rejected because all the data differ significantly from $0$ and thus support the alternative hypothesis $\mu \neq 0$ corresponding to a $2$-tailed study. However, the $z$'s from the studies will be positive and negative respectively, and Stouffer's method will add the two $z$ scores to get an answer close to $0$ and thus say that the null hypothesis should not be rejected at all.

answered Jan 17 '12 at 14:30

Dilip Sarwate

41,202
4
94
200

I think there may be either (a) a misunderstanding of the OP's question and/or (b) a misunderstanding of what Stouffer's method is doing here. In particular, if you are combining two $p$-values from two-tailed tests using Stouffer's method, you would get a *highly positive* value in the example you provide since $\Phi^{-1}(1-p_i)$ would be large in both cases. – cardinal Jan 17 '12 at 15:14
On the other hand, it is not clear that "if the data from the two studies were combined, the null hypothesis would be rejected". As a slight application of *reductio ad absurdum*, suppose $X_i = -c \quad\forall i$ in study 1 and $Y_i = c \quad\forall i$ in study 2 where the sample sizes were the same. – cardinal Jan 17 '12 at 15:14
@cardinal $p$-value is the probability of obtaining a result as extreme as or more extreme than the one observed assuming the null is true. Thus, if the data are strongly positive, $$p=P\{\text{mean more positive}|H_0\}=1-\Phi(z),$$ while if data are strongly negative, $$p=P\{\text{mean more negative}|H_0\}=\Phi(z).$$ As I understand it, Stouffer's method sums $z$-scores instead of combining $p$-values using logarithms as in Fisher's method. Since $p=1-\Phi(z)$ with $z>0$ in one case and $p=\Phi(z)$ with $z<0$ in the other, the sum could be small leading to null $\mu=0$ not being rejected. – Dilip Sarwate Jan 17 '12 at 16:08
2

Maybe I am misunderstanding your answer then which discusses *two-tailed studies* which I interpreted as using a $p$-value from a two-tailed test which would involve the absolute value of the respective means. – cardinal Jan 17 '12 at 16:16
(-1) **Very temporarily**. (Will be removed upon clarification.) I am very hesitant to downvote here, as it may simply be I've got myself mixed up (in which case, please accept my apologies). But, I think some things need to be clarified here. – cardinal Jan 17 '12 at 16:30
1

First of all, if we're discussing two-tailed tests, then our $p$-value is based on $p = \mathbb P(|T| > t_{1-\alpha/2})$, say. Hence, anything far away from zero will give a *small* $p$-value. The $p$-value encodes nothing about the sign of $T$. Thus, when "backtransforming" this $p$-value in the Stouffer's test, we would get a *large* value of $Z_i = \Phi^{-1}(1-p_i)$ regardless of the sign of $T_i$. .../... – cardinal Jan 17 '12 at 16:47
.../... So, under this interpretation of your example, the Stouffer statistic would be quite large and likely to lead to a rejection. On the other hand, if we do a *one-tailed test* (say of $\mu > 0$), then the Stouffer test will give a smaller value. So, in some sense, I believe the logic in the answer is reversed. I think we have to be a little careful about claiming any monotonicity results (which I haven't checked) since a particular test statistic value will get a more extreme $p$-value in the one-sample case compared to the two-sample and this is a nonlinear relationship. – cardinal Jan 17 '12 at 16:49
1

Maybe it boils down to what is meant by $z$ and how it relates to $p$. You are using absolute values; I am not. If Stouffer's method sums absolute values, then no problem; if it sums signed values, then there is a problem as indicated in my answer. In meta-analysis of one-sided tests $\mu \leq 0$ (or the traditional $\mu = 0$) versus $\mu > 0$, $z$ is always positive and so maybe nobody mentions that absolute values are to be summed because there is no need to. If so, one should not blindly apply Stouffer's method in $2$-sided tests and sum $z$ scores if $z$ scores can be positive or negative. – Dilip Sarwate Jan 17 '12 at 17:06
Thank you for discussing this with me. Obviously, this medium makes it a little difficult. $Z_i$ is *not* always positive in a meta-analysis based on one-sided tests. It may *normally* be positive since there is probable reporting bias (i.e., results that show statistical significance are more likely to be published). But, there is nothing *a priori* making $Z_i$ positive. Think of this as a "Markov chain" $$T \to p \to Z \> .$$ We get from $p$ to $Z$ via the function $\Phi^{-1}(1-p)$. How we get from $T$ to $p$ depends on whether we use a one-sided or two-sided test.../... – cardinal Jan 17 '12 at 17:44
.../...If we use a two-sided test, then a $T$ value of, say, 2.5 or -2.5 gets the same (small!) $p$-value. This results in the same (positive, large!) Stouffer $Z$ value. If, instead, we use a one-sided test (say against $\mu > 0$), then the $p$-value is small in the first case and large in the second. Hence in the *one-sided* case (not the two-sided one), we would get one positive $Z$ and one negative $Z$. – cardinal Jan 17 '12 at 17:47
Going back to your first comment, I think what may be happening is you are conflating the original test-statistics with the "inferred" ones from the $p$-values. In your comment where you talk about $p$-values, note that the $p$-value is small in both cases, so the Stouffer $Z = \Phi^{-1}(1-p)$ will be positive, and large, in both cases (assuming the test statistic is far away from zero, of course). :) – cardinal Jan 17 '12 at 17:50
2

I think @cardinal is correct Dilip. Most of your reply looks good, but this part seems mistaken: "However, the z's from the studies will be positive and negative respectively." This presumes the z's would be calculated as if the studies performed *one* -tailed tests. With a correct two-tailed test, the z's will both be positive, not "positive and negative respectively" as you claim. There's still some thinking to do here: although Stouffer's method will conclude there is a significant difference, it cannot tell us in what direction! – whuber Jan 17 '12 at 19:46
1

I have concern with the first part of the answer : in one sample, we have most $z_i\ll0$, in a second sample, most $z_i\gg0$. Then "if the data from the two studies were combined, the null hypothesis would be rejected because all the data differ significantly from 0 and thus support the alternative hypothesis $\mu\ne0$". I don’t agree. First sample $-65.8,-82.5,-1.6,-19.9,-95.0,-83.1$, two sided $p=0.01$; second sample $37.6,43.3,56.1,66.6,58.8,89.2,79.2,77.2,144.3$, $p=0.0001$; pooled data, $p=0.3$. (data generated with `rnorm`) – Elvis Jan 17 '12 at 20:17
However I think that in meta analyses, the $p$-value can be used together to with the reported direction of the effect to obtain signed z's. But I also think that if you have discordant effect direction between studies, the sample size should be used as well in the meta analysis... – Elvis Jan 17 '12 at 20:19
I hope you will have the opportunity to update your answer at some point. I think it has potential to be quite informative, which is one reason I've taken an interest in it. Please do not misconstrue my rather copious comments as some form of discontent; rather, I hope you can see them simply as an expression of interest and a desire to gain some further insight into this procedure. Cheers. :) – cardinal Jan 18 '12 at 00:47
1

Would you care to discuss this in chat sometime, perhaps this weekend? I would be both interested and willing. Cheers. :) – cardinal Jan 21 '12 at 03:52

Why is Stouffer's method often used with one-tailed $p$-values rather than two-tailed $p$-values?

1 Answers1