0

I want to rigorously formalize the statistical model (i.e. the set of "allowed distributions") for the chi-squared goodness of fit test (when the probabilities of intervals are fixed and don't depend on unknown parameters, i.e. $T(\mathbf{X}) \sim \chi^2_{k-1}$ if $n \to \infty$) and formalize its hypotheses using this model. Unfortunately, the statistical textbooks that I have seen, didn't do this rigorously. Below is my attempt to do this and two questions about it (below $\mathbf{X} = (X_1, \ldots, X_n)$ is i.i.d. sample, $\mathbf{x}=(x_1,\ldots,x_n)$ - its realization).

Case I. Input data: positive integer numbers $\nu_1,\ldots,\nu_k$ and $p_1,\ldots,p_k \in (0,1)$, where $\sum_{j=1}^k \nu_j = n, ~\sum_{j=1}^k p_j = 1$.
Let's divide the support $R_X = \mathrm{supp}(F_{\mathrm{true}})$ of a random variable $X \sim F_{\mathrm{true}}$ on $k$ mutually exclusive intervals $A_1,\ldots,A_k$, which have fixed, but unknown, boundaries; the support $R_X$ and $F_{\mathrm{true}}$ are unknown too. Using $n$ independent observations $x_1, \ldots, x_n$ of the r.v. $X$, somebody calculated $\nu_1, \ldots, \nu_k$ - the observed frequencies of hitting the intervals $A_1,\ldots,A_k$ respectively.
In this case I define statistical model as $\mathscr{F} = \{F: \exists A^F_1, \ldots, A^F_k \subset \mathrm{supp}(F): A^F_1 \sqcup \ldots \sqcup A_k^F = \mathrm{supp}(F)\}$ (this family contains all univariate continuous distributions, it also contains all univariate discrete distributions such that $|\mathrm{supp}(F)| \ge k$).

And the hypotheses are the following:
$H_0: ~ F_{\mathrm{true}} \in \mathscr{F}_0$, where $\mathscr{F}_0 \subset \mathscr{F}$ such that $\forall F \in \mathscr{F}_0 \hookrightarrow P_{F}(X \in A^F_j) = p_j, \, \forall j =1,\ldots,k$.
$\iff \widetilde H_0: F_{\mathrm{true}} \in \mathscr{F}_0$, where $\mathscr{F}_0 \subset \mathscr{F}$ such that $\forall F \in \mathscr{F}_0 \hookrightarrow \boldsymbol\nu(\mathbf{X}) = (\nu_1(\mathbf{X}), \ldots, \nu_k(\mathbf{X})) \sim \mathrm{Mult}(n,\mathbf{p}),$ where $\mathbf{p} = (p_1, \ldots, p_k),~ \nu_j(\mathbf{X}) = \sum_{i=1}^n I(X_i \in A_j^F);~$ (we often simply write "$H_0: \mathbf{p}_{\mathrm{true}} = \mathbf{p}$")

$H_1: F_{\mathrm{true}} \in \mathscr{F}_1 = \mathscr{F} \setminus \mathscr{F}_0$.      ($\iff H_1: \mathbf{p}_{\mathrm{true}} \neq \mathbf{p}$)

In terms of the random vector $\boldsymbol\nu(\mathbf{X}) = (\nu_1(\mathbf{X}), \ldots, \nu_k(\mathbf{X}))$, statistical model is the family of all $n$-dimensional multinomial distributions, i.e. $\mathcal{V} = \{F_{\mathrm{Mult(n, \mathbf{p})}}: \mathbf{p} \in (0,1)^k\}$.

Finally, the test statistic is $\displaystyle T(\mathbf{X}) = \sum_{j=1}^k \frac{(\nu_j(\mathbf{X}) - n p_j)^2}{np_j}$, and its realization is $\displaystyle T(\mathbf{x}) = \sum_{j=1}^k \frac{(\nu_j - n p_j)^2}{np_j}$.

Case II. Input data: $\mathbf{x} = (x_1,\ldots, x_n)$ - realization of i.i.d. sample $\mathbf{X} = (X_1,\ldots, X_n)$ from unknown distribution $F_{\mathrm{true}}$, and also some unknown distribution $F_0$. It is assumed that we know the support $R_X = \mathrm{supp}(F_{\mathrm{true}})$ of a random variable $X \sim F_{\mathrm{true}}$.
Let's divide the support $R_X = \mathrm{supp}(F_{\mathrm{true}})$ of a random variable $X \sim F_{\mathrm{true}}$ on $k$ mutually exclusive intervals $A_1,\ldots,A_k$ (so that each of them will have at least one observation from $\mathbf{x}$): $R_X = A_1 \sqcup \ldots \sqcup A_k$ and calculate probabilities $p_j = P_{F_0}(X \in A_j), \, \forall j =1,\ldots,k$. If we get $\sum_{j=1}^k p_j < 1$, or at least one of $p_j$s will be equal to 0 or 1, then we will say that chi-squared goodness of fit test is inappropriate for $F_0$. After that we calculate $\nu_1, \ldots, \nu_k$ – the observed frequencies of hitting the elements of $\mathbf{x}$ into the intervals $A_1, \ldots, A_k$ respectively. Finally, we formalize statistical model and hypotheses.

Case III. The same as case II, but the support ($\mathrm{supp}(F_{\mathrm{true}})$) is unknown. Here we should put $R_X = \mathbb{R}$ and after that do the same things as in case II.

My questions.

  1. Are my formalizations of $\mathscr{F}, H_0, H_1$ (they are given in case I) appropriate for cases I and III?
  2. It seems obvious that in case II we should replace $\mathscr{F}$ with $\widetilde{\mathscr{F}} = \{F: \mathrm{supp}(F)=R_X \text{ and } \exists A^F_1, \ldots, A^F_k \subset \mathrm{supp}(F): A^F_1 \sqcup \ldots \sqcup A_k^F = \mathrm{supp}(F)\}$. But I have never seen that any program package for calculating chi-squared GOF test or its characteristics (power, etc.) somehow uses information about the support in calculations. What is the reason for that?
Rodvi
  • 749
  • 1
  • 7
  • 17
  • 1
    I think your mathematical formalism misses some of the most important aspects of this test, especially at "Let's divide the support RX=supp(Ftrue) of a random variable X∼Ftrue on k mutually exclusive intervals A1,…,Ak (so that each of them will have at least one observation from x)." Creating the intervals based on the data is a mistake, as explained at https://stats.stackexchange.com/a/17148/919. Your formalism also abstracts out most of the features of the data, making it less than insightful: it confuses the distribution of the test statistic with the model. – whuber Sep 27 '21 at 21:14
  • @whuber Well, let's discard cases II-III, and consider only case I (assuming that the intervals are built correctly by somebody). Then I don't see what is wrong with my formalism. Here we just can't consider discrete distributions with less than $k$ unique values, so I threw them away from the model. – Rodvi Sep 28 '21 at 05:52

0 Answers0