why can Pearson’s chi-square test for goodness of ﬁt be used for testing normality?

Question

Mathematical Statistics and Data Analysis by Rice says

9.9 Tests for Normality

A wide variety of tests are available for testing goodness of ﬁt to the normal distribu- tion. We discuss some of them in this section; more discussion may be found in the works referred to.

If the data are grouped into bins, with several counts in each bin, Pearson’s chi- square test for goodness of ﬁt may be applied. But if the parameters are estimated from ungrouped data and the expected counts in each bin are calculated using the estimated parameters, the limiting distribution of the test statistic is no longer chi- square. In order for the limiting distribution to be chi-square, the parameters must be estimated from the grouped data. This was pointed out by Chernoff and Lehmann (1954) and is further discussed by Dahiya and Gurland (1972). Generally speaking, it seems rather artiﬁcial and wasteful of information to group continuous data.

What do the two cases "If the data are grouped into bins" and "if the parameters are estimated from ungrouped data" mean?

Why in the first case, "Pearson’s chi- square test for goodness of ﬁt may be applied"? I.e. why can Pearson’s chi- square test for goodness of ﬁt can be used for testing normality?

Why in the second case, "the limiting distribution of the test statistic is no longer chi-square"?

Why "it seems rather artiﬁcial and wasteful of information to group continuous data"?

Scortchi - Reinstate Monica · Answer 1 · 2020-06-09T08:29:58.053

Pearson's chi-squared test for goodness of fit is a particular case of Rao's score test: given counts of independent observations from a multinomial distribution, it tests the null hypothesis that the parameters (the probabilities $\pi=(\pi_1, \ldots, \pi_{m-1})$, for $m$ categories) are constrained to have a particular relationship (in the extreme, the distribution is fully specified) against the alternative that the parameters are unconstrained (a saturated model). The test statistic is derived as $U^\mathrm{T}(\tilde\pi) \mathcal{I}(\tilde\pi) U(\tilde\pi)$, where $U$ is the score function & $\mathcal{I}$ the Fisher information, both evaluated at the restricted maximum-likelihood estimate under the null, $\tilde\pi$; & is asymptotically distributed as chi-squared with the no. degrees of freedom equal to the no. independent constraints.

The normality test relies on binning $n$ i.i.d. observations (according to cut-points $c_2, \ldots ,c_{m}$, setting $c_1=-\infty$, & $c_{m+1}=\infty$) & performing the score test on the counts in each bin, $N_j$. When the null is simple—a normal distribution with known mean $\mu$ & standard deviation $\sigma$—clearly it constrains each of the $m-1$ parameters to a particular value. When the null is composite—$\mu$ & $\sigma$ are unknown—it may be helpful to consider a reparametrization of the model in which $\mu$ & $\sigma$ serve to specify two particular $\pi_j$ (it can be confirmed that the relation is one-to-one):

$$\begin{align} \psi &= (\mu, \sigma) \\ \theta &= (\theta_1, \ldots, \theta_{m-3}) \end{align} $$

where

$$ \pi_j = \begin{cases} \Phi(c_{j+1}; \psi) - \Phi(c_j; \psi) + \theta_j & \text{for } j=1, \ldots, m-3 \\ \Phi(c_{j+1}; \psi) - \Phi(c_j; \psi) & \text{for } j=m-2, m-1 \end{cases} $$

The alternative hypothesis is $\theta_j \neq 0$ for $j=1, \ldots, m-3$; the full likelihood

$$ \prod_{j=1}^{m-3}[\Phi(c_{j+1}; \psi)]-\Phi(c_j;\psi)+ \theta_j]^{N_j} \cdot \prod_{j=m-2}^{m}[\Phi(c_{j+1};\psi)]-\Phi(c_j;\psi)]^{N_j} $$

& the null hypothesis is $\theta_j=0$ for $j=1, \ldots, m-3$, constraining $m-1-2$ parameters; the constrained likelihood

$$ \prod_{j=1}^{m}[\Phi(c_{j+1};\psi)]-\Phi(c_j;\psi)]^{N_j} $$

Two points need emphasis:

The cut-points for binning are fixed, pre-specified, & the counts are random variables. Recognition of this precludes entirely, for example, defining the cut-points as quantiles of the observed counts, thus making the former random & the latter fixed;^† & mandates that even minor adjustments to the cut-points in the light of the data necessitate taking the results of the test with a pinch of salt.
The constrained maximum-likelihood estimator of $\psi$ is $$\tilde\psi=\operatorname*{argsup}_\psi\prod_{j=1}^{m}[\Phi(c_{j+1};\psi)]-\Phi(c_j;\psi)]^{N_j}$$ Other estimators with the same asymptotic efficiency can be used in its stead, I believe: but estimators based on sufficient statistics calculated from the unbinned observations are demonstrably inappropriate; & an estimator's being based on the binned observations is a necessary, but by no means a sufficient, condition.

A quick illustration of what can go wrong follows, borrowing @BruceET's example, & much of his code:

set.seed(2020)

#set up problem

n <- 200 #sample size
psi <- c(mu=100,  sigma=15) # true, unknown, parameter values
psi.guess <- c(mu=75,  sigma=10) # guess at parameter values - for binning
k <- 10 # no. bins
cutpoints.fixed <- qnorm( # fixed cutpoints 
  seq(0,1, len=k+1),
  psi.guess["mu"], psi.guess["sigma"]
)

# prepare simulation

no.sims <- 10e3 # no. simulations to perform
Q_1 <- numeric(no.sims) # wrong test statistic (cutpoints derived from sample, non-M,L. estimates)
Q_2 <- numeric(no.sims) # right test statistic (fixed cutpoints, M.L. estimates)
log.likelihood <- function(psi, observed, cutpoints){ # log-likelihood function under null
  sum(observed*log(diff(pnorm(cutpoints, psi[1], psi[2]))))
}

# simulate

for(i in 1:no.sims){
  x <- rnorm(n, psi["mu"], psi["sigma"])
  # wrong way
  cutpoints <- quantile(x, seq(0,1, len=k+1))
  observed <- hist(x, br=cutpoints, plot=FALSE)$counts
  midpoints <- (cutpoints[1:k] + cutpoints[2:(k+1)])/2
  mu.mpe <- sum(observed * midpoints)/n # midpoint estimate of mu
  sigma.mpe <- sqrt(sum(observed*(midpoints - mu.mpe)^2)/(n-1)) # midpoint estimate of sigma
  expected <- n * diff(pnorm(cutpoints, mu.mpe, sigma.mpe))
  Q_1[i] <- sum((observed - expected)^2/expected)
  # right way
  observed <- hist(x, br = cutpoints.fixed, plot = FALSE)$counts
  psi.mle <- optim( # find parameter values that maximize log-likelihood under null
    par = c(mean(x), sd(x)), # use estimates from raw sample as initial values
    fn = log.likelihood, control=list(fnscale = -1),
    observed=observed, cutpoints = cutpoints.fixed
  )$par
  names(psi.mle) <- c("mu", "sigma")
  expected <- n * diff(pnorm(cutpoints.fixed, psi.mle["mu"], psi.mle["sigma"]))
  Q_2[i] <- sum((observed - expected)^2/expected)
}

p_value.Q_1 <- 1 - pchisq(Q_1, k-1-2)
p_value.Q_2 <- 1 - pchisq(Q_2, k-1-2)

The distribution of $Q_1$, calculated with cut-points defined according to the quantiles of the observed counts, & estimates of $\mu$ & $\sigma$ made by taking each observation to be at the mid-point of its bin, is far from $\chi^2(7)$, while the distribution of $Q_2$, calculated correctly, is very close:

Consequently the p-values obtained from $Q_1$ are stochastically lower, & Type I error inflated:

† Not a bad idea in itself—goodness-of-fit tests based on the empirical distribution function (e.g. Kolmogorov–Smirnov, Anderson–Darling) take this approach, dispensing with the binning. See Impact of data-based bin boundaries on a chi-square goodness of fit test? for discussion of & references for the distribution of Pearson's test statistic itself when cut-points are random.

BruceET · Answer 2 · 2020-05-28T22:34:33.343

It is important to recognize that the two situations you mention are very different from the start.

When the exact null distribution is known. If the null hypothesis is that data come from a particular normal distribution (with $\mu$ and $\sigma$ known) then we can begin immediately to find the probabilities of each interval, and from those probabilities to find the expected counts for each interval.

For example, suppose I have a sample of size $n=200$ from $\mathsf{Norm}(\mu=100, \sigma=15).$ Ordinarily, it may be convenient to choose cutpoints between bins as evenly spaced round numbers, but that can lead to bins (intervals) with small counts. Instead we make $k = 10$ bins based on sample deciles.

set.seed(2020)  # for reproducibility
n = 200;  mu=100;  sg=15;  k=10
x = rnorm(n, mu, sg)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54.15   89.03  101.01   99.95  110.91  148.02 

cp = quantile(x, seq(0,1, len=k+1)); cp   # cut points
       0%       10%       20%       30%       40%       50% 
 54.14975  78.13640  85.58828  91.41253  96.85385 101.00739 
      60%       70%       80%       90%      100% 
104.14235 108.78387 113.09397 119.67388 148.02448

Notice that the cut points span the data from min to max. Now we show a histogram of the data made using these cutpoints, and we count the number (out of 200) of observations in each bin. These are the 'observed' counts for the chi-squared test.

hist(x, br=cp, col="skyblue2")
obs = hist(x, br=q, plot=F)$counts; obs   # bin counts
 [1] 20 20 20 20 20 20 20 20 20 20

Now we use the normal CDF pnorm to find the probability under the hypothetical normal curve within each bin, and multiply by $n = 200$ to get the 'expected' counts for the bins.

p = diff(pnorm(q, mu, sg))               # bin prob's
exp = n*p       
hist(x, br=cp, col="skyblue2")
  curve(dnorm(x, mu, sg), add=T, col="red")
  abline(v = cp, col="purple")

Finally, we find the chi-squared statistic, which has approximately the distribution $\mathsf{Chisq}(\nu = k-1),$ and use that distribution to find the P-value of the test. The P-value is too large to reject the null hypothesis.

Q = sum( (obs-exp)^2/exp )
p.val = 1-pchisq(Q, 9); p.val    # P-value of test
[1] 0.731932

When parameters must be estimated. Now suppose that the question is whether the data belong to some normal distribution (with unknown mean $\mu$ and standard deviation $\sigma.$ You will see at the end of the test that we subtract two degrees of freedom from the chis-squared distribution.

As discussed here the subject of degrees of freedom is not as simple as some elementary texts seem to say. Roughly speaking, an argument for the subtraction is that the estimation of each parameter imposes a constraint on the counts. In order for that to be approximately true, we need to use the bin counts to do the estimation.

Now suppose we have the same data x as in the example above, but that we need to estimate $\mu$ and $\sigma$ for the step where we use the normal quantile function to find probabilities p.

If we have the counts f and the interval midpoints m, we can estimate $\mu$ as $$\hat \mu = \frac{1}{n}\sum_{i=1}^k f_im_i.$$ Also, we can estimate $\sigma^2$ as $$\hat{\sigma^2} = \frac{1}{n-1}\sum_{i=1}^k f_i(m_i - \hat \mu)^2.$$

Now, interval midpoints m can be found as shown below, and the estimates of $\mu$ and $\sigma$ are not bad. Also, they are not much different from the sample mean and standard deviation, which we do not use. [In effect, these estimates assume that all data within an interval are located at the interval midpoint.]

m = (cp[1:k]+cp[2:(k+1)])/2
mu.e = sum(obs*m)/200
mu.e; mean(x)
[1] 99.97796    # aprx 100
[1] 99.94932
vr.e = sum(obs*(m-mu.e)^2)/199
sd.e = sqrt(vr.e)
sd.e; sd(x)
[1] 18.00426    # aprx 15
[1] 16.938

From there we can use mu.e and sd.e instead of mu and sg, respectively, to estimate interval probabilities p. The rest is the same as before, except that the approximating distribution for $Q$ is $\mathsf{Chisq}(\nu = k-1-2 = 7).$

@whuber. Seem to recall you have a demo showing the need to estimate parameters in a chi-sq GOF test from bin counts. Tried searching without success. Can you give the link? — BruceET, May 27 '20 at 17:53
You should use maximum-likelihood estimates for the mean & variance derived from the binned data - Pearson's chi-squared test is Rao's score test for a multinomial distribution. (So proceed as for interval-censored data.) — Scortchi - Reinstate Monica, May 28 '20 at 21:38
@Scortchi-ReinstateMonica: Thanks much for this link and for your own additional comment. I have edited my answer to go directly to the link. — BruceET, May 28 '20 at 22:42
You're welcome. whuber's answer was very memorable :) I would suggest correcting your example of estimating parameters from the binned data, or at least caveating it - at best it's only an approximation to the right way of doing it. Note also that the cut-offs for the bins should be fixed, pre-specified, rather than random, calculated from the sample. — Scortchi - Reinstate Monica, May 28 '20 at 23:20

why can Pearson’s chi-square test for goodness of ﬁt be used for testing normality?

2 Answers2