I think I have some misconception about chi-square based tests such as Pearson's test or McNemar's test. The way the tests are defined, the test statistics produced by the formulas are not stable under constant proportions of observations to outcomes but grows with increasing sample sizes.
I don't see how that makes sense and would like to know if I can correct the statistics based on the sample size to get corrected p-values.
Specifically: Say I have 6 categories and observed frequencies and expected frequencies as follows (number at index $i$ represents frequency of class $i$):
expected = [19 18 11 12 26 14]
observed = [18 18 13 15 18 18]
With Pearson's test I get
statistic=5.407692307692308, pvalue=0.3681739442518473
Now I take a bigger sample, where the proportions are the same (every frequency multiplied by 4):
expected = [ 76 72 44 48 104 56]
observed = [72 72 52 60 72 72]
statistic=21.630769230769232
pvalue=0.0006153325274858169
McNemar has the same problem:
factor = 1
[[38 82]
[92 14]]
pvalue 0.4951688582884312
statistic 82.0
factor = 2
[[ 76 164]
[184 28]]
pvalue 0.3084332194099148
statistic 164.0
factor = 3
[[114 246]
[276 42]]
pvalue 0.20429165958493722
statistic 246.0
factor = 4
[[152 328]
[368 56]]
pvalue 0.13927374787349467
statistic 328.0
This does not make sense to me intuitively in regard to what the test is supposed to tell me. If the proportions stay the same, why would the distributions be judged differently in respect to "how well they fit together" (which is what the tests basically tell me, right?)?
So, if I increase my sample size while the relative frequencies of classes stay approximately the same for all sample sizes, I would like to get a more accurate p-value... not an ever shrinking one.
I assume I have to apply some normalization somewhere. Is that the case and how do I do this?