Why is the chi-square test more popular than the G-test?

Question

Pearson's chi-square test and the G-test are two goodness-of-fit hypothesis tests for categorical data -- i.e., testing whether a sample came from a given distribution on a finite set. The respective test statistics are given by $$\chi^2 = \sum_i^k \frac{(X_i-np_i)^2}{np_i} ~~~~~ \text{ and } ~~~~~ G = 2\sum_i^k X_i \log\left(\frac{X_i}{np_i}\right),$$ where $n = \sum_i^k X_i$ is the sample size and $X_i$ is the observed number of items in category $i$ and $p$ is the null hypothesis distribution.

As far as I can tell, the G-test is "better," but the chi-square test is more popular. Why?

By better, I simply mean in terms of the statistical results (i.e., p-values) it produces, rather than aesthetics or ease of use. In my experience, the G-test is better when the number of categories $k$ is large (but the distribution $p$ is quite concentrated, e.g., Poisson or geometric) and is also more robust to changing a single sample. Wikipedia says "For testing goodness-of-fit the G-test is infinitely more efficient than the chi squared test in the sense of Bahadur, but the two tests are equally efficient in the sense of Pitman or in the sense of Hodges and Lehmann."

The only explanation I have been able to find comes from an unsourced (and now removed) paragraph on Wikipedia:

The approximation of G by chi squared is obtained by a second order Taylor expansion of the natural logarithm around 1. This approximation was developed by Karl Pearson because at the time it was unduly laborious to calculate log-likelihood ratios.^{[citation needed]} With the advent of electronic calculators and personal computers, this is no longer a problem.

Is that the reason? The chi-square is used to avoid computing logarithms?

Another answer on this site recommends the chi-square test instead of the G-test "because R has a convenient function for that which includes the option of simulating the p-value." This is reasonable, but it makes you wonder why there isn't an R function for the G-test.

I would greatly appreciate it if anyone with a better knowledge of statistics and its history can shine some light on this question. Perhaps there is some advantage of chi-square that I am not aware of or perhaps it's entirely historical.

It's not clear what you mean by "*significance*" other than *power*. If you mean any distinction between the two, could you clarify? — Glen_b, Aug 03 '20 at 12:16
@Glen_b I have edited to clarify. (And thanks for your answer.) — Thomas, Aug 03 '20 at 16:26

Glen_b · Answer 1 · 2021-10-12T01:10:43.457

The Pearson test is popular because it's simple to compute - it's amenable to hand-calculation even without a calculator (or historically, even without log-tables) - and yet generally has good power compared to alternatives; the simplicity means it continues to be taught in the most basic subjects. There might be argued that there's an element of technological inertia in the choice, but actually I think the Pearson chi-squared is still an easily defendable choice in a wide range of situations.

Being derived from a likelihood ratio test, the Neyman-Pearson lemma would suggest that the G-test should tend to have more power in large samples, but generally the Pearson chi-squared test has similar power in large samples (asymptotically it should be equivalent in the Pitman sense - there's some brief discussion about various kinds of asymptotics below - but here I just mean what you tend to see in large samples with a small effect size and at typical significance levels, without worrying about a particular sequence of tests by which $n\to\infty$.)

On the other hand, in small samples, the set of available significance levels has more impact than asymptotic power; I don't think there's usually a big difference, but in some situations one or the other may have an advantage*.

* But in that case the neat trick of combining the two may be even better - that is, using one statistic to break ties in another (non-equivalent) test when you have small samples, increasing the set of available significance levels -- and so improving power by allowing the type I error rate to be closer to a desired significance level without having to do something as unappetizing as randomized tests. (In tests of independence for tables that are larger than 2x2 it can also work with the rxc version of the Fisher exact test.)

Both the Pearson and G-test may be placed into the set of (Cressie-Read) power-divergence statistics (Cressie and Read, 1984 [1]), by setting $\lambda=1$ and $\lambda=0$ respectively; this family of statistics includes several other previously defined statistics, such as the Neyman ($\lambda=-2$) and the Freeman-Tukey statistic ($\lambda=\frac12$) among others, and in that context - considering several criteria - Cressie and Read suggested that the statistic with $\lambda=\frac23$ is a good compromise choice for a statistic.

The efficiency issue is worth a brief mention; each definition compares the ratio of sample sizes under two tests. Loosely, Pitman efficiency considers a sequence of tests with fixed level $\alpha$ where the sample sizes achieve the same power over a sequence of ever-smaller effect sizes, while Bahadur efficiency holds the effect size fixed and considers a sequence of decreasing significance levels. (Hodges-Lehmann efficiency holds the significance level and effect size constant and lets the type II error rate decrease toward 0.)

Other than among some statisticians, it doesn't seem very common that most users of statistics consider using different significance levels; in that sense the sort of behavior we might tend to see if a sequence of increasing sample sizes were available would hold the significance level constant (for all that other choices might be wiser; it can be difficult to calculate). In any case, Pitman efficiency is the most often used.

On this topic, P. Groeneboom and J. Oosterhoff (1981) [2] mention (in their abstract):

the asymptotic efficiency in the sense of Bahadur often turns out to be quite an unsatisfactory measure of the relative performance of two tests when the sample sizes are moderate or small.

On the removed paragraph from Wikipedia; it's complete nonsense and it was rightly removed. Likelihood ratio tests were not invented until decades after Pearson's paper on the chi-squared test. The awkwardness of computing the likelihood ratio statistic in a pre-calculator era was in no sense a consideration for Pearson then, since the concept of Likelihood ratio tests simply didn't exist. Pearson's actual considerations are reasonably clear from his original paper. As I see it, he takes the form of the statistic directly from the term (aside the -\frac12) in the exponent in the multivariate normal approximation to the multinomial distribution.

If I was writing the same thing now, I'd characterize it as the (squared) Mahalanobis distance from the values expected under the null.

it makes you wonder why there isn't an R function for the G-test.

It can be found in one or two packages. However, it's so simple to calculate, I never bother to load them. Instead I usually compute it directly from the data and the expected values that are returned by the function that calculates the Pearson chi-squared statistic (or occasionally - at least in some situations - I compute it instead from the output of the glm function).

Just a couple of lines in addition to the usual chisq.test call are sufficient; it's easier to write it fresh from scratch each time than loading a package to do it. Indeed, you can also do an "exact" test based on the G-test statistic (conditioning on both margins) - using the same method that chisq.test does, by using r2dtable to generate as many random tables as you like (I tend to use a lot more tables than the default used by chisq.test in R unless the original table is so large that it would take a very long time)

References

[1]: Cressie, N. and Read, T.R. (1984),
"Multinomial Goodness‐Of‐Fit Tests."
Journal of the Royal Statistical Society: Series B (Methodological), 46, p. 440-464.

[2]: P. Groeneboom and J. Oosterhoff (1981),
"Bahadur Efficiency and Small-sample Efficiency."
International Statistical Review, 49, p. 127-141.

+1 I might situate the first three paragraphs with respect to the timeline of the tests' development, the building of textbooks teaching one of them as a form of contingency table analysis, and the availability of ubiquitous desktop computing. :) — Alexis, Aug 03 '20 at 16:34
My apologies, it looks like you're saying something that would improve my answer but in my befuddled state I'm not parsing it well and I don't want to miss anything. In the first part I think you're saying I should mention a sequence like say K. Pearson (1900) for the chi-squared test, J.Neyman and E.S. Pearson (1933) in respect of LRT tests in general and Woolf (1957) for the G-test in particular? And presumably you intend that I put what's in the first three paragraphs into that timeline? ... ctd — Glen_b, Aug 04 '20 at 03:29
I'm not sure I'm well placed on laying out information on a broad spectrum of textbooks prior to about the 70s - and did you mean for some particular area, or within stats itself? ... Sorry, if I mistake what you meant. (I think some further clarification of the comment would be useful in any case) — Glen_b, Aug 04 '20 at 03:30
Sorry Glen_b... Yes something like your first comment. I mentioned ubiquitous computing because the calculations for the $\chi^2$ are reasonable to do even by hand, as long as one has a table of critical values to interpret for rejection boundaries, whereas the calculations for the $G$ test are gonna require facility with a slide rule, tables of natural logs, or having trained the mnemonics for doing logs in one's head, etc.… $G$ is just harder to compute unless one has a fancy (for the 70s) calculator or a computer, and wonder if that may have born upon the historical preference for $\chi^2$? — Alexis, Aug 04 '20 at 15:45
Oh, okay. Thanks. That's perfectly clear and would fit into the answer. — Glen_b, Aug 04 '20 at 23:00

Why is the chi-square test more popular than the G-test?

1 Answers1