3

I seem to be struggling with lack of basic understanding of some important concepts.

This is a question to the answer of @Glen_b in this post: Warning in R - Chi-squared approximation may be incorrect

Glen_b writes: "..the chi-square approximation to the distribution of the test statistic relies on the counts being roughly normally distributed"

I do not understand how this can be true if one for example have a 2x2 contingency table. How can the counts be (approximately) normally distributed? How can white, black, hispanic and asian counts possibly be normally distributed? Does he mean just slightly equal?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Erosennin
  • 1,384
  • 17
  • 31

1 Answers1

3

The chi-squared contingency table test works because $\dfrac{O_i-E_i}{\sqrt{E_i}}$ is approximately distributed as a standard normal random variable and so the sum of squares of this has approximately a chi-squared distribution. Constraints such as $\sum E_i = \sum O_i$ (and possibly other constraints, for example in a two-way contingency table) reduce the number of degrees of freedom.

This is an approximation which comes close to treating $O_i$ as a binomial random variable with parameters $n=\sum E_i$ and $p=E_i / \sum E_i$, and so can only take discrete values rather than the continuous support of an actual normal random variable. Yates's correction for continuity is sometimes used to mitigate the bias in the chi-squared test that this difference might cause.

So Glen_b's statement is essentially about the distributions of the individual cell counts given the constraints, and amounts to little more than saying that the normal distribution can be an approximation of the binomial distribution, at least if the expected values of the individual cells are not too small, essentially the de Moivre–Laplace theorem.

Henry
  • 30,848
  • 1
  • 63
  • 107