Exceptions to the rule "if the 95% CI does not contain 0, then the effect is significant"

Question

A well-known rule of thumb is that, for an estimate to be significantly different from 0, its 95% CI has to not include 0, i.e. the lower and the upper bounds of the CI have to be either both positive or both negative.

However, I know that for certain measures of effect size, the CI does not abide by this rule. For instance, for partial eta squared, the lower bound of its CI is by definition >=0 (Kline, 2004). Thus, anyone who looks at the CI of the effect size and finds it does not contain 0, might mistakenly believe that the reported effect is significant, whereas this will not necessarily be the case.

Thus, how are those two things reconciled, and wherein does my misunderstanding lie?

Reference: Kline RB (2004) Beyond significance testing. Washington, DC: American Psychological Association

I think the essence of the answer here is that some effect size statistics are directional --- that is, that they can be positive, negative, or 0. While other effect size statistics are non-directional --- that is, they are always 0 or positive. Inference with the confidence intervals for these statistics is challenging. I wrote a long answer unpacking this, as I don't think the currently accepted answer really addresses the question as posed. — Sal Mangiafico, Nov 21 '19 at 17:20

Dave · Answer 1 · 2020-05-09T21:16:56.187

5

It depends on what kind of test you’re doing.

For a difference between the values from two distributions, a nonzero difference indicates an effect. This would cover the usual testing of mean differences with $H_0: \mu_0-\mu_1=0$.

However, with testing variances, it is more common to consider the ratio of variances. In this case, the null would be $H_0: \sigma^2_0/\sigma^2_1 =1$. Consequently, it is desired for the confidence interval of that ratio not to include 1. Zero doesn’t come into the mix. A ratio of zero means that the top variance is zero.

So anyone seeing a confidence interval for a ratio of variances of $(0.8, 1.9)$ and calling the effect significant has made a mistake.

The theory behind this is that a confidence interval is an inversion of a hypothesis test. The confidence interval partitions the space of possible values into what would and would not be rejected by an $\alpha$-level test.

Casella and Berger get into this when they talk about interval estimation. Their unfortunate terminology calls these partitions the “rejection region” and “acceptance region” ( even though we do not quite “accept” a null hypothesis).

(I have seen on here that some exotic confidence intervals need not obey this rule, and I will allow someone else to address such details.)

All of this assumes a two-sample comparison. In the one-sample case, we still have the same thinking that we want to check a confidence interval for some surmised value, such as $\mu=\mu_0$ where $\mu_0$ need not be zero.

(The two-sample case could use a null of something other than equality, such as $H_0: \mu_0-\mu_1=6$. Then the interesting question is if the confidence interval contains 6.)

edited May 09 '20 at 21:16

answered Nov 19 '19 at 11:56

Dave

28,473
4
52
104

2

1/3 "I have seen on here that some exotic confidence intervals need not obey this rule, and I will allow someone else to address such details." Relevant: Browne, R. H. (1979). [On visual assessment of the significance of a mean difference](https://www.jstor.org/stable/2530259). *Biometrics*, 35(3), 657–665. – Alexis Nov 21 '19 at 17:23
2

2/3 Cumming, G. (2009). Inference by eye: Reading the overlap of independent confidence intervals. *Statistics In Medicine*, 28(2), 205–220. – Alexis Nov 21 '19 at 17:24
2

3/3 Smith, R. W. (1997). Visual hypothesis testing with confidence intervals. Proceedings of the Twenty-Second Annual SAS® Users Group International Conference. – Alexis Nov 21 '19 at 17:24
2

1/2 Bonus for equivalence tests: Tryon, W. W. (2001). [Evaluating Statistical Difference, Equivalence, and Indeterminancy Using Inferential Confidence Intervals: An Integrated Alternative Method of Conducting Null Hypothesis Statistical Tests](https://pdfs.semanticscholar.org/fe40/f86583bdf82fee415c653882378684da18dd.pdf). *Psychological Methods*, 6(4), 371–386. – Alexis Nov 21 '19 at 17:26
2

2/2/ Tryon, W. W., & Lewis, C. (2008). [An Inferential Confidence Interval Method of Establishing Statistical Equivalence That Corrects Tryon’s (2001) Reduction Factor](https://www.researchgate.net/profile/Warren_Tryon/publication/23244009_An_Inferential_Confidence_Interval_Method_of_Establishing_Statistical_Equivalence_That_Corrects_Tryon's_2001_Reduction_Factor/links/54401bad0cf2be1758cffaae.pdf). *Psychological Methods*, 13(3), 272–277. – Alexis Nov 21 '19 at 17:26

score 1 · Accepted Answer · answered Nov 21 '19 at 17:15

One salient point in this question is that it refers to effect size statistics. The fact that these statistics can be either directional or non-directional, I think is the essence of the answer to the question.

Some effect size statistics are directional †. That is, they can contain positive, negative, or zero values, with a positive value indicating a positive correlation, or, usually, the values for the first group being larger than the second. These include r, Spearman's rho, Kendall's tau, phi for 2 x 2 contingency tables, Cohen's d for the means of two groups, and Cliff's delta for the ranks of two groups.

In these cases, it's a fair interpretation that if the confidence interval does not include zero, then the effect size statistic is "significant". An exception here is that if the statistic is reported as the absolute value ‡ --- that is, always positive --- confidence intervals by bootstrap may not reflect this property. As a final note here, since there are different ways to compute confidence intervals (e.g. by formula or by bootstrap), this conclusion may not always match exactly the related hypothesis test. (That is, say, a 95% CI for Cohen's d may not match exactly the results from a t test).

There are other effect size statistics that are always positive. Examples include r-squared, eta-squared, partial eta-squared, Cramer's v that is used for contingency tables larger than 2 x 2, and the epsilon-squared that is sometimes used for Kruskal-Wallis. Often these are used in situations where it is not easy to convey the effect size in a negative-or-zero-or-positive framework. For example, while we can use phi in a 2 x 2 contingency table, and it makes sense to look at the correlation as positive or negative, in larger tables there may not be a clear way to think about the correlation as positive or negative.

For these statistics, because they are always greater than or equal to zero, an accurate confidence interval would never include values less than zero. Some calculations for confidence intervals may include zero, which would be a potential indicator of "statistically zero effect", but usually calculations won't be so precise. In the case of determining confidence intervals by bootstrap, the range would be unlikely to contain a zero by a method like percentile, although if the confidence interval were determined by normal bootstrap, the interval could cross zero. If this latter method were appropriate for a specific statistic of concern, this approach may perhaps be viable for inference.

One final note, there's an interesting case of Vargha and Delaney's A, which is directional, but where 0.50 indicates no effect.

† I have seen this term used this way for effect size statistics, but I'm not sure it's meaning is universal in this context.

‡ For example, I've seen R functions that report the absolute value of certain effect size statistics.

THat indeed addresses my question perfectly - thank you! – z8080 Nov 22 '19 at 15:28 — z8080, Nov 22 '19 at 15:28

Exceptions to the rule "if the 95% CI does not contain 0, then the effect is significant"

2 Answers2

Linked