1

Chi-squared distribution with $k$ degrees of freedom is defined as the distribution of the sum of the squares of $k$ standard normal random variables:

$$\chi^2 = \sum_{i=1}^k Z_i^2$$

Where each $Z_i\sim \mathcal{N}(0, 1)$.

Now consider normal random variables $X_1, X_2, \cdots, X_k$ with means $\mu_1, \mu_2, \cdots, \mu_k$ and variances $\sigma_1^2, \sigma_2^2, \cdots, \sigma_k^2$. In this case, each $X_i$ can be converted to the standard normal random variable as:

$$\frac{X_i-\mu_i}{\sigma_i}$$

Thus, $\chi^2$ can also be written in the form of thse variables:

$$\chi^2 = \sum_{i=1}^k \frac{(X_i-\mu_i)^2}{\sigma_i^2}$$

However, when we use \emph{Chi-squared test}, we construct the following statistic:

$$\sum_{i=1}^n \frac{(f_{io}-f_{ie})^2}{f_{ie}}$$

where $n$ is the total number of categories, and $f_{io}$ and $f_{ie}$ are the observed and expected frequencies for category $i$. I am trying to understand what is the relation between the original definition of $\chi^2$ and the test statistic written above. I verified computationally that the test statistic indeed has the correct distribution with $n-1$ degrees of freedom but I am able to make much progress after that. One possibility that I thought of was that the observed frequencies should be Poisson distributed with mean $f_{ie}$ and hence the corresponding standard deviation would be $\sqrt{f_{ie}}$. Hence if I pretend that I have normal (instead of Poisson) distribution for each category frequency $f_{io}$, then just like the normal random variables, I can convert these to the ``standard'' form as:

$$\frac{f_{io}-f_{ie}}{\sqrt{f_{ie}}}$$

and squaring and adding such terms would give me the test statistic. However, this has to be wrong because the original definition contains as many standard normals as the degrees of freedom but the statistic contains one more. Can somebody kindly clarify this relationship?

Peaceful
  • 603
  • 3
  • 19
  • I can't confirm but that sounds correct to me because, under the null that the categories are independent ( rows independent of columns ), then $f_{io} = f_{ie}$ and the mean of the numerator would be zero which would lead to a N(0,1) ( using the CLT if $n$ is large enough ) . But hopefully someone else can confirm. – mlofton Apr 19 '21 at 06:52
  • What do you mean by "I verified computationally that the test statistic indeed has the correct distribution with −1 degrees of freedom"? I would have thought that if your fluctuations are gaussian, and there is no fit reducing the degrees of freedom, then the test statistic would have followed a $\chi^2$ with $n$ degrees of freedom. Also, I think in practice, Poisson is close to gaussian even for relatively small (8) expectations. – Mister Mak Apr 19 '21 at 09:00
  • @MisterMak: I took 4 classes with probabilities $[0.1, 0.25, 0.3, 0.35]$ and then generated $1000$ samples from them, calculated $\chi^2$ statistic for each of them and plotted their histogram. On top of that, I plotted the actual $\chi^2$ with $3$ degrees of freedom, and I get good match. If instead I use $4$ degrees of freedom, I get a terrible match. – Peaceful Apr 19 '21 at 10:06
  • My account of the $\chi^2$ test at https://stats.stackexchange.com/a/17148/919 addresses most of your questions. – whuber Apr 19 '21 at 12:51
  • @whuber : At least for me, it doesn't. I already read your answer twice but I seem to lack most of the mathematical weapons to relate it to my question. I found a derivation [here](http://personal.psu.edu/drh20/asymp/lectures/asymp.pdf) in a link you posted somewhere else, but I am also looking for an intuitive explanation. – Peaceful Apr 19 '21 at 12:54
  • @whuber: "You must base that estimate on the counts, not on the actual data!". What does that really mean? Also, how on earth a layman like me can decide the number of degrees of freedom if you think that definitions elsewhere are wrong? Where is your definition? – Peaceful Apr 19 '21 at 12:56

0 Answers0