22

My situation is as follows: I want, through a Monte-Carlo study, to compare $p$-values of two different tests for statistical significance of an estimated parameter (null is "no effect - parameter is zero", and the implied alternative is "parameter is not zero"). Test A is the standard "independent two-sample t-test for equality of means", with equal variances under the null.

Test B I have constructed myself. Here, the null distribution used is an asymmetric generic discrete distribution. But I have found the following comment in Rohatgi & Saleh (2001, 2nd ed, p. 462)

"If the distribution is not symmetric, the $p$-value is not well defined in the two-sided case, although many authors recommend doubling the one-sided $p$-value".

The authors do not discuss this further, nor do they comment on the "many authors suggestion" to double the one-sided $p$-value. (This creates the question "double the $p$-value of which side? And why this side and not the other?)

I was not able to find any other comment, opinion or result on this whole matter. I understand that with an asymmetric distribution although we can consider an interval symmetric around the null hypothesis as regards the value of the parameter, we will not have the second usual symmetry, that of probability mass allocation. But I do not understand why this makes the $p$-value "not-well defined". Personally, by using an interval symmetric around the null hypothesis for the values of the estimator I see no definitional problem in saying "the probability that the null distribution will produce values equal to the boundaries of, or outside this interval is XX". The fact that the probability mass on the one side will be different than the probability mass on the other side, does not appear to cause troubles, at least for my purposes. But it is rather more probable than not that Rohatgi & Saleh know something that I don't.

So this is my question: In what sense the $p$-value is (or can be) "not well defined" in the case of a two-sided test when the null distribution is not symmetric?

A perhaps important note: I approach the matter more in a Fisherian spirit, I am not trying to obtain a strict decision rule in the Neyman-Pearson sense. I leave it up to the user of the test to use the $p$-value information alongside any other information to make inferences.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • 4
    In addition to the likelihood-based ("Fisherian") and LR-based (N-P) approaches, another method considers how to obtain *short* confidence intervals and uses those for hypothesis testing. This is done in the spirit of decision theory (and using its methods), where length is included within the loss function. For unimodal symmetric distributions of the test statistic, obviously the shortest possible intervals are obtained using symmetric intervals (essentially "doubling the p-value" of one-sided tests). Shortest-length intervals depend on the parameterization: thus they cannot be Fisherian. – whuber Mar 06 '15 at 16:30
  • I was wondering if the answers posted here would be also applicable on beta distributions. Thanks. – JLT Jul 13 '17 at 10:40

2 Answers2

14

If we look at the 2x2 exact test, and take that to be our approach, what's "more extreme" might be directly measured by 'lower likelihood'. (Agresti[1] mentions a number of approaches by various authors to computing two tailed p-values just for this case of the 2x2 Fisher exact test, of which this approach is one of the three specifically discussed as 'most popular'.)

For a continuous (unimodal) distribution, you just find the point in the other tail with the same density as your sample value, and everything with equal or lower likelihood in the the other tail is counted in your computation of p-value.

For discrete distributions which are monotonically nonincreasing in the tails, it's just about as simple. You just count everything with equal or lower likelihood than your sample, which given the assumptions I added (to make the term "tails" fit with the idea), gives a way to work it out.

If you're familiar with HPD intervals (and again, we're dealing with unimodality), it's basically like taking everything outside an open HPD interval that's bounded in one tail by your sample statistic.

enter image description here

[To reiterate -- this is likelihood under the null we're equating here.]

So at least in the unimodal case, it seems simple enough to emulate Fisher's exact test and still talk about the two tails.

However, you may not have intended to invoke the spirit of Fisher's exact test in quite this way.

So thinking outside that idea of what makes something 'as, or more extreme' for a moment, let's head just slightly more toward the Neyman-Pearson end of things. It can help (before you test!) to set about defining a rejection region for a test conducted at some generic level $\alpha$ (I don't mean you have to literally compute one, just how you would compute one). As soon as you do, the way to compute two tailed p-values for your case should become obvious.

This approach can be valuable even if one is conducting a test outside the usual likelihood ratio test. For some applications, it can be tricky to figure out how to compute p-values in asymmetric permutation tests... but it often becomes substantially simpler if you think about a rejection rule first.

With F-tests of variance, I've noticed that the "double one tail p-value" can give quite different p-values to what I see as the right approach. [I insist, for example, that it shouldn't matter which group you call "sample 1", or whether you put the larger or the smaller variance in the numerator - yet with some common approaches these apparently reasonable conditions are violated.]

[1]: Agresti, A. (1992),
A Survey of Exact Inference for Contingency Tables
Statistical Science, Vol. 7, No. 1. (Feb.), pp. 131-153.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • "As soon as you do, the way to compute two tailed p-values for your case should become obvious" -- I am actually not sure what your mean; could you perhaps clarify? In general, why can't your "Fisherian" p-values, as you defined them on the figure above, be used for Neyman-Pearson testing in exactly the same way? – amoeba Mar 04 '15 at 10:18
  • @amoeba Informally, because N-P deals with both the null and the alternative while Fisher's approach considers only the null. If we're considering the alternative (but not restricting ourselves to likelihood ratio tests), then the alternative tells us what regions of the test statistic are 'more extreme'. When we choose a rejection rule which has probability of rejection $\alpha$ under a simple null, we figure out then how to allocate probability to the two tails; this defines how p-values work. ... ctd – Glen_b Mar 04 '15 at 11:40
  • 1
    ctd... If we're doing a likelihood ratio test, the likelihood ratio is always one-tailed, but if we construct an equivalent two tailed test based on some statistic then we still look to smaller likelihood ratios to locate "more extreme" – Glen_b Mar 04 '15 at 11:45
  • @Glen_b So if I understand correctly, under unimodality "more extreme values" should have a probabilistic meaning: "values with lower probability" irrespective of whether the implied non-rejection interval is symmetric around the null or not... While, what I thought of was to create a symmetric interval around the null based on the obtained estimate and then add the probabilities outside it and on the boundary (CONTD) ... – Alecos Papadopoulos Mar 04 '15 at 11:57
  • CONTD ... essentially my idea interprets "more extreme" in terms of the values themselves (more extreme in absolute terms), and not in terms of the associated probabilities, and would amount to consider the distribution of the _absolute value_ of the estimator, and calculate a one-tail p-value). Am I understanding your answer correctly? – Alecos Papadopoulos Mar 04 '15 at 11:58
  • @Glen_b, thanks. Does it actually mean that under the N-P framework when we have a clearly defined alternative, the p-values are always one-sided? "More extreme" should be then interpreted as "further from the null AND closer to the alternative", as opposed to simply "further from the null"? – amoeba Mar 04 '15 at 12:37
  • 1
    Alecos, what @Glen_b described is definitely closer to my intuitive understanding of the word "extreme", than what you are describing in the last two comments. Imagine a unimodal but super-skewed distribution of the test statistic, so that $-0.1$ is a very very uncommon value under the null, but $+0.1$ is still very common. If you obtain the value $-0.1$ in some dataset, this should certainly cast doubt on the null, but under your interpretation you would look at $|-0.1|=0.1$ which is not suspicious at all. By taking absolute value you destroy the sensitivity! – amoeba Mar 04 '15 at 12:41
  • 1
    @amoeba Certainly, sometimes you have to write it in order to see it. Indeed, we are in the land of probabilities so "extreme in terms of probability" appears to me too to be the right approach than "extreme in terms of mathematical value". And only the asymmetric setup helps to clarify that, in the symmetric case the distinction is lost, at least for the "no-effect" null hypothesis. One more case textbooks do not prepare you for... – Alecos Papadopoulos Mar 04 '15 at 12:51
  • 2
    Doubling the one-tailed p-value might be defended as a Bonferroni correction for carrying out two one-tailed tests. After all, following a two-tailed test, we're usually very much inclined to regard any doubt cast on the truth of the null as favouring another hypothesis whose direction is determined by the data. – Scortchi - Reinstate Monica Mar 04 '15 at 13:20
  • 1
    @Alecos it's simple enough to justify a symmetric choice! I find it hard to see how you'd read what I wrote as suggesting a symmetric choice was in any way not a valid thing to do (that choice is covered by the discussion I gave about the rejection rule - you can easily construct a symmetric rejection rule). The first part of my answer was responding to the part in the question about Fisher. If you ask about Fisher, should I not discuss what it seems Fisher might do, based on what he did in similar circumstances? You seem to interpret my response as saying more than it is. – Glen_b Mar 04 '15 at 14:42
  • @amoeba A *likelihood ratio test* is one-tailed in that you only reject when the likelihood ratio is small. I'm sure I said this already. – Glen_b Mar 04 '15 at 14:46
  • 1
    @Alecos In particular, I am not advocating Fisher, or Neyman Pearson approaches (whether we are talking about likelihood ratio tests or just hypothesis tests more generally), nor should you consider me as trying to suggest that anything that I have omitted might be wrong. I'm just discussing a number of the things you seemed to be raising in your question. – Glen_b Mar 04 '15 at 14:51
  • @Glen_b: Rather tangential - but as $F(t;\nu_1,\nu_2)=1-F(\frac{1}{t};\nu_2,\nu_1)$, where $F(\cdot)$ is the F-distribution cdf, isn't it the "double one tail p-value" that is invariant to whether you put the larger or smaller variance in the numerator? Whereas working out the probability of all values with lower likelihood than the observed test statistic isn't? Or am I misunderstanding what you wrote? – Scortchi - Reinstate Monica Mar 04 '15 at 16:02
  • 1
    @Glen_b Your answer helped me realize the distinction "more extreme in value" (i.e. of greater value in absolute terms) against "more extreme in probability" (i.e. of lower probability), and this is what I tried to convey in my comment -under the Fisherian approach. Of course, if we start by the determination of the non-rejection region, as you discuss in the second part of your answer, we can make it symmetric around zero, that's clear, since we _decide_ what the non-rejection region will be. – Alecos Papadopoulos Mar 04 '15 at 17:14
  • 2
    Ultimately, yes. The neat thing about Fisher's approach is it gives a very sensible way of arriving at a p-value without even having an alternative. But if you do have specific alternatives of interest, you can target your rejection region more or less precisely to those alternatives by declaring the parts of the sample space where the alternatives will tend to put your samples as rejection region. A test statistic, T, is a convenient way of achieving that, in essence by associating a single number with each point in it (giving us a 'more extreme' as measured by T). ... ctd – Glen_b Mar 04 '15 at 23:22
  • 1
    (ctd) ... Then your rejection rule uses the test statistic to "order" the sample space (at least, it establishes a partial order), and that gives a way to find a p-value since 'as or more extreme' is explicitly identified by the rejection rule. – Glen_b Mar 04 '15 at 23:24
12

A p-value's well-defined once you create a test statistic that partitions the sample space & orders the partitions according to your notions of increasing discrepancy with the null hypothesis. (Or, equivalently, once you create a set of nested rejection regions of decreasing size.) So what R. & S. are getting at is that if you consider either high or low values of a statistic $S$ to be interestingly discrepant with your null hypothesis you still have a little work to do to get a proper test statistic $T$ from it. When $S$ has a symmetric distribution around nought they seem to leap to $T=|S|$ without much thought, & therefore regard the asymmetric case as presenting a puzzle.

Doubling the lowest one-tailed p-value can be seen as a multiple-comparisons correction for carrying out two one-tailed tests. After all, following a two-tailed test, we're usually very much inclined to regard any doubt cast on the truth of the null as favouring another hypothesis whose direction is determined by the observed data. A proper test statistic is then $t=\min(\Pr_{H_0}(S<s),\Pr_{H_0}(S>s))$, & when $S$ has a continuous distribution the p-value is given by $2t$.

When $S$ has a continuous distribution, the approach to forming a two-tailed test shown by @Glen_b—defining the density of $S$ as the test statistic: $T=f_S(S)$—will of course produce valid p-values; but I'm not sure that it was ever recommended by Fisher, or that it's currently recommended by neo-Fisherians. If at first glance it appears more principled somehow than doubling the one-tailed p-value, note that having to deal with probability density rather than mass means that that the two-tailed p-value thus calculated may change when the test-statistic is transformed by an order-preserving function. For example, if to test the null that a Gaussian mean is equal to nought, you take a single observation $X$ & obtain $1.66$, the value with equal density at the other tail is $-1.66$, & the p-value therefore $$p=\Pr(X > 1.66) +\Pr(X<-1.66)=0.048457+0.048457=0.09691.$$ But if you consider it as testing the null that a log-Gaussian geometric mean is equal to one & take a single observation $Y$ & obtain $\mathrm{e}^{1.66}=5.2593$, the value with equal density at the other tail is $0.025732$($=\mathrm{e}^{-3.66}$), & the p-value therefore $$p=\Pr(Y>5.2593) +\Pr(Y<0.025732)=0.048457+0.00012611=0.04858.$$

enter image description here

Note that cumulative distribution functions are invariant to order-preserving transformations, so in the example above doubling the lowest p-value gives \begin{align}p=2t&=2\min(\Pr(X<1.66),\Pr(X>1.66))\\&=2\min(\Pr(Y<5.2593),\Pr(Y>5.2593))\\&=2\min(0.048457,0.951543)\\&=2\times 0.048457=0.09691.\end{align}

A kind of sequel to this answer, discussing some principles of test construction in which the alternative hypothesis is explicitly stated, can be found here.

† When $S$ has a discrete distribution, writing

$$p_\mathrm{L} = \Pr_{H_0}(S\leq s)$$ $$p_\mathrm{U} = \Pr_{H_0}(S\geq s)$$

for the lower & upper one-tailed p-values, the two-tailed p-value is given by

$$ \Pr(T\leq t) = \begin{cases} p_\mathrm{L} + \Pr_{H_0}(P_\mathrm{U} \leq p_\mathrm{L}) & \text{when}\ p_\mathrm{L} \leq p_\mathrm{U}\\ p_\mathrm{U} + \Pr_{H_0}(P_\mathrm{L} \leq p_\mathrm{U}) & \text{otherwise} \end{cases} $$

; i.e. by adding to the smaller one-tailed p-value the largest achievable p-value in the other tail that does not exceed it. Note that $2t$ is still an upper bound.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 1
    Oh wow. This is a very good point, +1. What is your advice then? Also, can I interpret this discrepancy as corresponding to different (in this case implicit) choices of test statistic? – amoeba Mar 05 '15 at 18:05
  • @Scortchi Thanks. Although I am dealing with a discrete distribution with bounded support, it is certainly useful to discuss the continuous case. – Alecos Papadopoulos Mar 05 '15 at 23:21
  • @amoeba (1) Well, generally I'd suggest doubling the lowest one-sided p-value; for inference about a model parameter the (generalized) likelihood ratio test is also available. Neither of these approaches "care" about order-preserving transformations. (2) The choice of test statistic is between the different densities $f_X(x)$ vs $f_Y(y)= f_X(\log y)\frac{\mathrm{d}x}{\mathrm{d}y}$. – Scortchi - Reinstate Monica Mar 06 '15 at 12:47
  • In your formula with $\min$ one occurence of $s$ should probably read $-s$? Apart from that, I am not sure I understand the logic behind taking the minimal of the two p-values. In your last example this would mean doubling p=0.0001 and not p=0.048, even though the actually observed value was 1.66 and not -3.66. This looks weird. – amoeba Mar 06 '15 at 16:25
  • 1
    @amoeba: Not a typo! And when you observe 1.66 you take the the minimum of 0.952 & 0.048. If you actually observed -3.66 it'd be the minimum of 0.0001 & 0.9999. – Scortchi - Reinstate Monica Mar 06 '15 at 16:40
  • Ah! I see. I was confused, because the previous discussion between Glen_b and Alecos somehow set me on a wrong track to understand what you mean. Perhaps you can supplement your explicit example with how you suggest to compute p-value in this particular case. – amoeba Mar 06 '15 at 16:42
  • 1
    @Scortchi I have just accepted Glen_b's answer because it was more "useful" to me in the narrow sense. But yours helped me to _avoid_ the trap of thinking that "that's all there is to it", which is an excellent insurance policy for future risks. Thanks again. – Alecos Papadopoulos Mar 20 '15 at 18:39
  • @AlecosPapadopoulos: That's fine, of course, & you're welcome. (As both Glen_b & I have taken pains to point out, there's not a uniquely correct approach.) – Scortchi - Reinstate Monica Mar 21 '15 at 13:28
  • 1
    @Scortchi I have to agree; my response took a rather simplistic and one-sided view, and I should qualify, extend and justify the answer. I'll probably do that in several stages. – Glen_b Mar 22 '15 at 06:06
  • 1
    @Glen_b: Thanks, I look forward to it. I also want to extend mine to show how score tests & generalized likelihood ratio tests give different answers (in general); & the theory of unbiased tests is surely worth mentioning in this context (but I can barely remember it). – Scortchi - Reinstate Monica Mar 22 '15 at 15:37
  • How can I do if I want to cite your answer. Or what is the reference on which you rely. Citing is doable on [MS](https://math.stackexchange.com/users/253096/kanak), but not on CV. – keepAlive Sep 04 '17 at 12:38
  • 1
    @Kanak: Copy the MS format I suppose. The 1st paragraph could be from any text on theoretical Statistics; the 2nd I recall from Cox & Hinkley, *Theoretical Statistics*, though doubtless it's not only to be found there. The rest was worked out in response to a specific example - again the ideas of invariance/equivariance are discussed in many texts. – Scortchi - Reinstate Monica Sep 04 '17 at 12:53
  • A question about your remark "...having to deal with probability density rather than mass...": I think what you meant here is that for discrete null distributions, the p-value defined via summing over outcomes that are less likely under null will be invariant under transformations and so this particular critique of this approach does not apply. Right? However, for discrete nulls doubling one-sided p-value will still in general give a different result. Would you still advocate doubling one-sided p-value? My question is provoked by this thread: https://stats.stackexchange.com/questions/284641. – amoeba May 08 '18 at 14:55
  • @amoeba: (1) Right! Still debatable in any particular case whether less probable outcomes are sensibly viewed as more extreme. (2) No! The same "two one-tailed tests" argument would lead to, rather than doubling the one-tailed p-value, adding the largest achievable smaller p-value in the other tail. (3) The q. you link to doesn't forbid consideration of alternative hypotheses, unlike this one, so there are a few other tests in the running. – Scortchi - Reinstate Monica May 08 '18 at 16:29
  • Thanks. Point (2) is quite interesting. I suggest you add some discussion of the discrete case to this answer (if time allows). Would be very useful and relevant for this thread I think. – amoeba May 08 '18 at 18:31
  • 1
    @Glen_b By the way, this is a reminder that you were going to "qualify, extend and justify [your] answer" :-) perhaps you would still like to do it at some point. Personally, I would certainly appreciate it; I find this thread very interesting and important. – amoeba May 08 '18 at 18:34