Chi-square based tests tending to zero for large sample sizes

Question

I think I have some misconception about chi-square based tests such as Pearson's test or McNemar's test. The way the tests are defined, the test statistics produced by the formulas are not stable under constant proportions of observations to outcomes but grows with increasing sample sizes.

I don't see how that makes sense and would like to know if I can correct the statistics based on the sample size to get corrected p-values.

Specifically: Say I have 6 categories and observed frequencies and expected frequencies as follows (number at index $i$ represents frequency of class $i$):

expected = [19 18 11 12 26 14]

observed = [18 18 13 15 18 18]

With Pearson's test I get

statistic=5.407692307692308, pvalue=0.3681739442518473

Now I take a bigger sample, where the proportions are the same (every frequency multiplied by 4):

expected = [ 76  72  44  48 104  56]

observed = [72 72 52 60 72 72]

statistic=21.630769230769232
pvalue=0.0006153325274858169

McNemar has the same problem:

factor = 1
[[38 82]
 [92 14]]
pvalue      0.4951688582884312
statistic   82.0 

factor = 2
[[ 76 164]
 [184  28]]
pvalue      0.3084332194099148
statistic   164.0 

factor = 3
[[114 246]
 [276  42]]
pvalue      0.20429165958493722
statistic   246.0 

factor = 4
[[152 328]
 [368  56]]
pvalue      0.13927374787349467
statistic   328.0

This does not make sense to me intuitively in regard to what the test is supposed to tell me. If the proportions stay the same, why would the distributions be judged differently in respect to "how well they fit together" (which is what the tests basically tell me, right?)?

So, if I increase my sample size while the relative frequencies of classes stay approximately the same for all sample sizes, I would like to get a more accurate p-value... not an ever shrinking one.

I assume I have to apply some normalization somewhere. Is that the case and how do I do this?

Yes. The other one has such a bad title though, that I would not have clicked on it unless I already knew that that is basically the same question as mine. — lo tolmencre, Apr 24 '19 at 13:29

Glen_b · Accepted Answer · 2019-04-23T23:59:32.903

The precision of the proportions is affected by sample size, so the same difference in proportions may be quite consistent with the null at a small sample size and quite inconsistent with it at a large sample size.

Consider a slightly different problem but with the same underlying point. If you toss a coin 6 times and get heads 4 times, you wouldn't conclude your coin was badly biased (since you could easily get at least that far from 50-50 with only 6 tosses). If you tossed a coin 6000 times and got 4000 heads, you definitely should conclude your coin is biased.

Nothing is amiss in your results. That's exactly how it should work.

Oh, yes, makes total sense now. Thanks! – lo tolmencre Apr 23 '19 at 23:46 — lo tolmencre, Apr 23 '19 at 23:46

Chi-square based tests tending to zero for large sample sizes

1 Answers1