p-values from t.test and prop.test differ considerably

Question

Testing a difference in hits (ones versus zeros) between two independent groups $X$ and $Y$ should be possible with a t-test, according to the following considerations:

$x_i\in\{0,1\}$ is the measurement for the $i$-th item im group $X$, and $y_i\in\{0,1\}$ the same for group $Y$
the proportion in each group is the mean value of the measurements, i.e. $\mu_X=\sum_i x_i/n$ and $\mu_Y=\sum_i y_i/n$
the difference of mean values $\mu_X$ and $\mu_Y$ in two groups can be tested with a t-test

In this particular case, a proportion test (function prop.test in R) is an alternative test option. Interestingly, the results are quite different:

> x <- c(rep(1, 10), rep(0, 90))
> y <- c(rep(1, 20), rep(0, 80))
> t.test(x,y,paired=FALSE)
t = -1.99, df = 183.61, p-value = 0.04808
> prop.test(c(10,20), c(100,100))
X-squared = 3.1765, df = 1, p-value = 0.07471

Note the higher p-value of the prop.test. Does this mean that a t-test has a higher power, i.e., can distinguish between $H_0$ and its alternative already for smaller $n$? Is there a reason why a t-test should not be used in this case?

Addition (Edit: resolved in a comment under the answer by Thomas Lumley below): The result of the t-test is even more surprising in the light of the observation that even the asymptotic ("Wald") 95% confidence intervals of both measurements overlap (0.1587989 > 0.1216014):

> library(binom)
> binom.confint(10, 100, method="asymptotic")
      method  x   n mean      lower     upper
1 asymptotic 10 100  0.1 0.04120108 0.1587989
> binom.confint(20, 100, method="asymptotic")
      method  x   n mean     lower     upper
1 asymptotic 20 100  0.2 0.1216014 0.2783986

As confidence intervals based on the t-distribution should be even wider than those based on the normal distribution (i.e. $z_{1-\alpha/2}$), I do not understand why the t-test reports a significant difference at the 5% level.

Most of the difference is the continuity correction which was applied in the chi-squared but not in the t-test. The remainder of the difference is the same as the difference between z-test and t-test. If you use `correct=FALSE` in the chi-squared you will see the p-value is fairly close to that of the t-test, or if you compute the continuity-corrected t-test and compare with the above chi-squared, you'll see the two fairly close again. — Glen_b, Jul 23 '21 at 13:44
@glen-b Ah yes, this makes the difference. And the continuity correction makes the test overly conservative. I have found simulations by Agostino et al. (Am. Stat. 42, pp. 199-201, 1988) that show that the t test or the *uncorrected* chi square test (`correct=FALSE`) has an actual $\alpha$ probability much closer to the nominal level than the corrected chi square test. The t test is thus perfectly ok in this case, and my question is answered. — cdalitz, Jul 25 '21 at 09:19

score 9 · Answer 1 · edited Oct 08 '21 at 16:29

9

A difference between p-values of 0.048 and 0.074 is not large. This can easily happen between tests that don't do exactly the same but a similar thing.
The theory of the t-test is for normally distributed data, which your data obviously are not. You're right that the t-test can be justified as an approximation, but there's no reason to use an approximation if a more precise test (namely the proportion test) is available. For sure there is no reason to expect the t-test to have a better power, or only in case that it is anticonservative, which is not a good thing (being an approximation, one would probably need to simulate what its finite sample characteristics are in this situation).
Edited after looking up the reference Agostino et al. ("The Appropriateness of Some Common Procedures for Testing the Equality of Two Independent Binomial Populations", Am. Stat. 1988) given by cdalitz. This reference states that prop.test with continuity correction is too conservative whereas the t-test as well as the prop.test without continuity correction are normally closer to the nominal level, if occasionally anticonservative (which in may view not necessarily justifies an overall recommendation). This was also mentioned in the answer by Thomas Lumley.

If we're ignoring the continuity correction for a moment, there are two differences between the t-test and prop.test (which is not fully documented but I think it does the z-test based on normal approximation).

(a) prop.test uses the knowledge that the variance of the Binomial is $np(1-p)$ rather than using a sample variance based on normality. In my view what prop.test does here should clearly do better, as it is based on information about the specific setup used here.

(b) prop.test uses a normal approximation whereas the t-test uses a t-approximation. Now both of these, applied to the Binomial situation, are asymptotic in nature (the t-distribution is only precise if the underlying data are normal which they aren't here), and actually they are asymptotically equivalent. Although the normal approximation looks more intuitive based on the Central Limit Theorem, this doesn't imply by any means than the normal works better that the t in the finite sample situation (and the t is as well justified by the CLT, if only indirectly). The t-distribution is motivated by the normal assumption, but in fact it may also be the case that the asymptotic normal distribution of prop.test underestimates the finite sample variability because it ignores the variability in the variance estimation, and the t-distribution, despite here not precisely justified, may do a better job at that.

So I now believe that potentially (as could be confirmed by simulations, maybe somebody has done that?) the best thing to do could be using the test statistic of prop.test, i.e., the "correct" variance estimation, but replacing the asymptotic normal distribution by a t-distribution, which in some sense may put together the advantages of them both.

edited Oct 08 '21 at 16:29

kjetil b halvorsen

63,378
26
142
467

answered Jul 23 '21 at 12:43

Christian Hennig

10,796
8
35

According to the output of prop.test, it uses a "continuity correction" and the test statistic is 2. These indicate that the proportional test also uses approximations. Which approximation is better might not be obvious, but as this is an old problem, I would guess that someone already has done a thourough analysis and simulations. – cdalitz Jul 23 '21 at 20:43
1

Fair enough, but the continuity issue is the same with the t-test, except there it's not "corrected". If somebody would use the t-test in a submission and I were reviewer, I wouldn't accept it unless the author would run the simulations themselves to convince the reader that the t-test is better. Until that happens I believe strongly that it isn't. – Christian Hennig Jul 23 '21 at 23:50
1

–1 "A difference between p-values of 0.048 and 0.074 is not large." This statement depends on the context, and is not always true. – Alexis Jul 24 '21 at 04:29
4

+1 @Alexis A difference between p=0.048 and p=0.074 is trivial if you are using p-values as indices of evidence against the null hypothesis within the statistical model chosen. The difference _may_ be interesting if you are using a cutoff of p=0.05 to distinguish between a 'significant' result and a 'not significant' result, but you probably shouldn't be doing that. See here: for a start https://stats.stackexchange.com/questions/16218/what-is-the-difference-between-testing-of-hypothesis-and-test-of-significance/16227#16227 – Michael Lew Jul 24 '21 at 07:06
@Alexis The difference may make a large difference to how people interpret it (but they probably shouldn't, see @MichaelLew), however it is not large in the sense that it is very hard to imagine any distribution (from H0, alternative or elsewhere) and any test for which one of 0.048 and 0.074 is realistic to expect and the other one is not. – Christian Hennig Jul 24 '21 at 09:51
"The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant" https://www.tandfonline.com/doi/abs/10.1198/000313006X152649 – Christian Hennig Jul 24 '21 at 09:53
@Lewian "however it is not large in the sense that it is very hard to imagine any distribution (from H0, alternative or elsewhere) and any test for which one of 0.048 and 0.074 is realistic to expect and the other one is not." This is false in epidemiology and population health. – Alexis Jul 24 '21 at 18:21
@Alexis My statement was maybe not totally clear. I mean something like this: Find any test and any distribution of interest for that test so that for given small $\epsilon$ the probability of observing a p-value in $0.048\pm\epsilon$ is substantially different from having the p-value in $0.074\pm\epsilon$. If you can give or cite a counterexample, I'd be very interested. (Generally the null hypothesis won't do, because under $H_0$ the p-value is uniformly distributed, so both these probabilities will be the same.) – Christian Hennig Jul 24 '21 at 23:44
3

@lewian "If somebody would use the t-test in a submission and I were reviewer, I wouldn't accept it" Although a reviewer has the power to do so, I would not reject an arcticle because the authors use a different test method than I prefer. After searching more about this problem, I found a paper by Agostino et al. ("The Appropriateness of Some Common Procedures for Testing the Equality of Two Independent Binomial Populations", Am. Stat. 1988), who made this investigations and eventually recommended the t-Test or the uncorrected chi squared test (`prop.test(...,correct=FALSE)`. – cdalitz Jul 25 '21 at 09:00
@cdalitz As a reviewer one can ask to do a certain thing differently without rejecting the paper. – Christian Hennig Jul 25 '21 at 12:02
@cdalitz Thanks by the way for the reference. Very interesting. I'm not quite sure though whether the results fully back up the recommendation, they seem quite ambiguous in my view, and power was not investigated. Anyway I have to admit that a case for the t-test can be made in one way or another. But surely using the knowledge of how the Binomial variance relates to the parameter, which is used by prop.test but not by the t-test, shouldn't hurt!? – Christian Hennig Jul 25 '21 at 13:07
@cdalitz Edited my answer. – Christian Hennig Jul 25 '21 at 13:29
2

"I don't know how to do paragraphs within an item of a numbered list" -- find a post with a numbered list that does it and look at the markdown for that post. There's an example here: https://stats.stackexchange.com/questions/485348/constant-information-scale-transformation/485392#485392 – Glen_b Jul 25 '21 at 16:05

Dave · Answer 2 · 2021-07-23T14:59:01.267

The t-test can be quite robust to deviations from the normality assumption, particularly when sample sizes are large, so I understand why one might want to use a t-test for this task.

However, you know the parametric family; since the outcome is either $0$ or $1$, the distribution is completely characterized by the relative proportion, thus Bernoulli. Consequently, you can rely on a parametric test designed for a Bernoulli variable, which the t-test is not.

Methods that are robust to deviations from parametric assumptions are wonderful, since we typically do not know the type of population distribution. (If we did, why did we not determine the population parameters when we had the chance!?) However, the case of a binary variable is unique in how it is completely defined by the relative proportion and must be Bernoulli (or easy to represent as Bernoulli, such as calling “heads” and “tails” of a coin $0$ and $1$, respectively).

I think for investigating the power one would need to sample from the alternative, but I advise against it because I'm pretty sure that the only way a better power can arise is bias/being anticonservative. (Edited after the earlier comment was deleted.) — Christian Hennig, Jul 23 '21 at 12:49

score 7 · Accepted Answer · answered Jul 24 '21 at 07:04

You're correct that the tests should be more similar. They are tests of means, and for a light-tailed distribution, so you should expect them to agree. What's more, the estimated variance $\hat p(1-\hat p)/n$ for a binomial distribution is extremely close to $s^2/n$

> var(x)/100
[1] 0.0009090909
> .1*(.9)/100
[1] 9e-04
> .2*(.8)/100
[1] 0.0016
> var(y)/100
[1] 0.001616162

What you're seeing is the continuity correction. If you try it without, the $p$-values are almost identical

> t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = -1.99, df = 183.61, p-value = 0.04808
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1991454034 -0.0008545966
sample estimates:
mean of x mean of y 
      0.1       0.2 

> prop.test(c(10,20),c(100, 100),correct=FALSE)

    2-sample test for equality of proportions without continuity correction

data:  c(10, 20) out of c(100, 100)
X-squared = 3.9216, df = 1, p-value = 0.04767
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.197998199 -0.002001801
sample estimates:
prop 1 prop 2 
   0.1    0.2

The continuity correction for the chi-squared test is a bit controversial. It does dramatically reduce the number of situations where the test is anti-conservative, but at the price of making the test noticeably conservative. Not using the 'correction' gives p-values that are closer to a uniform distribution under the null hypothesis. And, as you see here, not using the correction gives you something closer to the t-test.

Thanks, this explains the difference! What I still do not understand is why the t-test reports a significant ($\alpha=5\%$)difference although the classic 95% confidence intervals $\pm z_{1-\alpha/2}\sqrt(\hat{p}(1-\hat{p})/n}$ (and, of course, also with $t$ instead of $z$) overlap. — cdalitz, Jul 24 '21 at 10:59
Non-overlap of 95% confidence intervals is a much stronger criterion than a 95% confidence interval for the difference overlapping zero. With equal variance and sample size, a 95% ci for difference overlaps zero if the difference is smaller than $1.96\sqrt{2}\sigma/\sqrt{n}|$ and two intervals overlap if the difference is smaller than $2\times 1.96\sigma/\sqrt{n}$. Each interval overlapping the other point estimate is much closer to a 5% threshold. — Thomas Lumley, Jul 25 '21 at 02:01
Ah yes, thanks.!This follows from $Var(X-Y)=Var(X)+Var(Y)$ for independent $X$ and $Y$. The continuity correction seems to be indeed controversial in this case: Agostino et al. ("The Appropriateness of Some Common Procedures for Testing the Equality of Two Independent Binomial Populations", Am. Stat. 1988) investigated its application in this particular use case of comparing proportions and came to the conclusion that the *uncorrected* chi squared test or the t test should be used. — cdalitz, Jul 25 '21 at 09:05

p-values from t.test and prop.test differ considerably

3 Answers3

Linked