Why not always use a binomial exact test to compare two proportions instead of chi square?

Question

I am trying to figure out what test I should use in the following scenario: I know that there is a lot of room for improvement in a specific area at work - being extremely critical, let's say that sampling $52$ observations, $31$ could be improved. After instituting an improvement / QA program for six months, let me assume that out of a sample of $55$ cases, there are only $11$ with residual flaws. The two samples are independent. We are therefore comparing two proportions: $p_{\text{ initial}} =\frac{31}{52}$ and $p_{\text{ final}} = \frac{11}{55}$.

Although the numbers are exaggerated, I still want to see if the two proportions are statistically significantly different, and I think I have a couple of options: I can run an exact binomial test to calculate the probability that the new proportion of flawed observations, $\frac{11}{55}$, would occur if the actual underlying probability remained $\frac{31}{52}$. Alternatively, I can run a chi-squared test.

The chi-squared is an approximation, and what I have read is that it is to be applied when the total number of observations is too high. This is clearly not the case in the example; however, playing with the numbers in R, I couldn't see any delay or problems with the results even after using numbers $>10,000$. And there was no indication of any normal approximation being used.

So, if this is all true, why shouldn't we always opt for an exact binomial test, rather than a chi square?

The code in R for the two test would be:

    # Exact Binomial Test:
binom.test(c(11, 55 - 11), p = 31/52, alternative ="less")

    #Chi-square Test:
prop.test(c(31, 11), c(52, 55), correct = FALSE, alternative = 'greater')

The application of `binom.test` seems inappropriate here. You need to compare two datasets, not one dataset to a fixed probability. Setting $p=31/52$ ignores the uncertainty in the estimated value of $31/52$ for the pre-intervention rate and thereby (substantially) increases the false positive error rate. — whuber, Jan 30 '15 at 17:14
The relevant exact tests for comparing two estimated proportions are Fisher's & Barnard's: see [On Fisher's exact test: What test would have been appropriate if the lady hadn't known the number of milk-first cups?](http://stats.stackexchange.com/q/136584/17230). — Scortchi - Reinstate Monica, Jan 13 '17 at 14:13

score 10 · Accepted Answer · answered Jan 31 '15 at 05:10

10

You state that you have read the chi-squared test should be used when "the total number of observations is too high". I have never heard this. I don't believe it is true, although it is hard to say, since "too high" is quite vague. There is a standard recommendation not to use the chi-squared test when there are any cells with expected counts less than 5. This traditional warning is now known to be too conservative. Having an expected count less than 5 in a cell is not really a problem. Nonetheless, maybe what you heard is somehow related to that warning.

As @whuber notes, the two different tests you ask about make different assumptions about your data. The exact test assumes that the probability (31/52) is known a-priori and without error. The chi-squared test estimates the proportions for both before and after. Notably, both of those proportions are treated as having uncertainty due to sampling error.

Thus, the chi-squared test will have less power, but is probably more honest. It may well be that the true proportion of flawed observations was considerably lower than 31/52, but it looked that bad by chance alone. You certainly may test if the after proportion is less than 31/52, just as you may test the after proportion against any value. But a significant result would not necessarily imply that the process improved following the QA program; you should only conclude that the proportion is less than an arbitrary number.

answered Jan 31 '15 at 05:10

gung - Reinstate Monica

132,789
81
357
650

Sorry to bother you again after so long. It's just that I am dealing with the practical applications of the discussion... Can I even suggest that the QA program _might_ have had a positive effect? – Antoni Parellada Aug 14 '15 at 22:37
I don't know if you saw my comment. It disappeared, or was never posted. I'm sorry to resuscitate this question, but you could help me a lot if you wouldn't mind commenting on whether we can at least suggest that the QA program _might_ have lead to improvement, given a significant p value. – Antoni Parellada Aug 14 '15 at 22:47
@AntoniParellada: There's no sign of any deleted comment. If the proportion under the null were estimated precisely, with say 310 defective out of 520 parts, this would approximate to knowing it "a priori & without error" - & these figures evidence improvement at a glance in any case - ; but in general the procedure you've described will give too small a p-value. There are exact tests for comparing proportions when the sample sizes are small. – Scortchi - Reinstate Monica Aug 15 '15 at 16:02
Thanks for responding! Re: deleted comments, I just had to refresh my browser. You know, I did get the part about the variability of the first proportion, and how you can't use it as a fixed reference. My question concerned what @gung made reference to his comment "you should only conclude that the proportion is less than an arbitrary number", but I think he meant _in the event that the initial proportion is used as a fixed quantity_. Thanks, again! – Antoni Parellada Aug 15 '15 at 17:54
@Scortchi sorry about the scrambled grammar... It doesn't allow any further edits... – Antoni Parellada Aug 15 '15 at 18:00
1

@AntoniParellada: You're welcome. I think he did mean that - you may conclude that the proportion is less than 31/52 without attaching any particular meaning to that number. (I've fixed the typo, I think.) – Scortchi - Reinstate Monica Aug 15 '15 at 18:56
2

@AntoniParellada, I'm not sure what to say here. You can always say that the program "might have had a positive effect" no matter what the p-value. That's because you cannot prove the null (see [here](http://stats.stackexchange.com/a/85914/7290)). In the event that the binomial test suggests the after % is < 31/52, you can say that it is < 31/52, but you can't necessarily say that it is < than the before %, because you haven't taken into account the uncertainty in the before %. – gung - Reinstate Monica Aug 16 '15 at 15:59
@gung this is blowing my mind, and I'll re-read your link until I _possibly_ get it. But are you hinting that there is no point in trying to compare the two proportions statistically (getting a _p_ value)? At the moment I find sentences such as "There being no pointing out the world, we can't say there's no pointing out things", more clear. Incidentally I found that on the follow-up review the percentage is very similar, and I'm holding off on releasing the findings lest there are consequences... – Antoni Parellada Aug 16 '15 at 16:31
1

@AntoniParellada, where are you quoting "There being no pointing out ..." from? I don't see that anywhere, & I don't know what it means. It is fine to compare 2 proportions. If you compare 1 empirical proportion to a value fixed a-priori (& get a sig result), you could say that your data suggest the proportion is < the threshold. If you want to say that the after % is < the before %, you need to run a chi-squared test that takes the uncertainty of both %'s into account. – gung - Reinstate Monica Aug 16 '15 at 16:38
@gung [Discourse on the Pointing out of Things](http://plato.stanford.edu/entries/school-names/pointing.html) – Antoni Parellada Aug 16 '15 at 16:46
@AntoniParellada, touche. – gung - Reinstate Monica Aug 16 '15 at 17:03

score 5 · Answer 2 · edited Jan 13 '17 at 02:47

I think what the OP is observing is that using the Clopper-Pearson method for calculating exact binomial probabilities can be done for very large in this age of fast computing, whereas in the past when the sample size got large it was easier to use a normal approximation which would be pretty accurate with or without a continuity correction. A chi squared approximation could be another way. It is my experience to that for very large sample sizes (1000 or more) the binomial test can be computed accurately and relatively fast.

The only drawback to using exact methods for discrete distributions is that for any given sample size n there are certain values of for significant levels that cannot be obtained exactly. So if you try to do sample size calculations searching to achieve a certain level of power and you search for the minimum sample size to achieve that power you might be surprised to see that going from n to n+1 can result in a decrease in power. This problem I referred to as a saw-toothed power function.

You can see examples of this in my paper with Christine Liu titled: The Saw-Toothed behavior of Power versus Sample Size and Software solutions: Single Binomial Proportion using Exact Methods. The American Statistician May 2002. You can find it quickly by googling saw-toothed power function. The same issue applies to confidence intervals.

An earlier paper by Agresti and Coulli was published in the American Statistician in 1998 which is a popular method for generating Binomial Confidence Intervals. This and other methods for obtaining binomial Confidence intervals can be found in the wikipedia article titled binomial proportion confidence intervals.

Why not always use a binomial exact test to compare two proportions instead of chi square?

2 Answers2