Formula for computing the Pearson $\chi^2$, comparison with R

Question

I suspect this question is really about basic definition, but I could not find the ressource I need to solve my problem.

I want to understand why the pearson $\chi^2$ test statistic, and corresponding residuals, are computed the way they are in R.

First, some tests:

>d<-data.frame(x=1:10000,y=sample(c(rep(1,100),0),10000,replace=T))
>M<-glm(y~x,family="binomial",data=d)
>d$p<-predict(M,type="response")
>chisq.test(table(d$p,d$y))

Pearson's Chi-squared test

data:  table(d$p, d$y)
X-squared = 10000, df = 9999, p-value = 0.4953

Warning message:
  In chisq.test(table(d$p, d$y)) :
  l'approximation du Chi-2 est peut-être incorrecte

Ok, now an alternative that gives consistent results computed as in this answer

>sum(residuals(M,type="pearson")^2)
[1] 10000.75

Considering the formula

$$\chi^2 = \sum_{i \in \text{observations}} (O_i - P_i)^2/P_i$$

where the $O_i$ are the observed values and $P_i$ are the probabilities given by the model, I would have, perhaps naïvely, calculated

>sum((d$y-d$p)^2/d$p)
[1] 14

which provides another result. This is because the résiduals are different:

>head(residuals(M,type="pearson")^2)
          1           2           3           4           5           6 
0.001286902 0.001286924 0.001286946 0.001286968 0.001286989 0.001287011 
>head((d$y-d$p)^2/d$p)
[1] 1.653989e-06 1.654045e-06 1.654101e-06 1.654157e-06 1.654213e-06 1.654269e-06

Going further, one discovers that the formula for the residuals is actually (see below for code)

$$\chi^2 = \sum_{i \in \text{observations}} (O_i - P_i)^2/(P_i(1-P_i))$$

which is much nicer (to my opinion) as it results in a constant $\chi^2$ as a function of $P_i$. But where does that come from?

Thanks!

----- how I found the last formula:

> getAnywhere(residuals.glm)
A single object matching ‘residuals.glm’ was found
It was found in the following places
  package:stats
  registered S3 method for residuals from namespace stats
  namespace:stats
with value

function (object, type = c("deviance", "pearson", "working",
    "response", "partial"), ...)
{
    type <- match.arg(type)
    y <- object$y
    r <- object$residuals
    mu <- object$fitted.values
    wts <- object$prior.weights
    switch(type, deviance = , pearson = , response = if (is.null(y)) {
        mu.eta <- object$family$mu.eta
        eta <- object$linear.predictors
        y <- mu + r * mu.eta(eta)
    })
    res <- switch(type, deviance = if (object$df.residual > 0) {
        d.res <- sqrt(pmax((object$family$dev.resids)(y, mu,
            wts), 0))
        ifelse(y > mu, d.res, -d.res)
    } else rep.int(0, length(mu)), pearson = (y - mu) * sqrt(wts)/sqrt(object$family$variance(mu)),
        working = r, response = y - mu, partial = r)
    if (!is.null(object$na.action))
        res <- naresid(object$na.action, res)
    if (type == "partial")
        res <- res + predict(object, type = "terms")
    res
}
<bytecode: 0x000000000a27cfc0>
<environment: namespace:stats>
> M$family$variance
function (mu)
mu * (1 - mu)
<bytecode: 0x000000000a191810>
<environment: 0x0000000008b31710>

score 0 · Accepted Answer · edited Apr 13 '17 at 12:44

0

chisq.test(table(d$p,d$y))

This command is not computing the chi-square test you think. Notice that the table command is going to create a contingency table of the two vectors and perform a chi-square test of independence. As an example of what the table command is doing, check out the 2 by 2 table created below

table( c("Y", "Y", "N", "N"), c(1,1,1,0))
  0 1
N 1 1
Y 0 2

I don't think there is any function that will give you a chi-squared statistic of the form $\sum (O_i - P_i)^2/P_i$ because this does not correspond to a chi-squared test I am aware of! If you are trying to perform a test of goodness of fit, you might want to check out the Hosmer-Lemeshow goodness of fit test.

In looking through the residuals function, it appears you stumbled upon the formula for Pearson residuals, but the motivation for these residuals is not for the purpose of performing a Pearson chi-squared test! Notice that since your data are bernoulli (binomial with $N=1$), $E(Y_i)=p_i$ and $Var(Y_i)=p_i(1-p_i)$. Thus, the Pearson residuals are equivalent to standardizing $Y_i$ by its estimated mean and variance. See Residuals from glm model with log link function for a discussion of what the Pearson residuals are typically used for. However, in this case of binary logistic regression, it's impossible to misspecify the relationship between the mean and variance of a bernoulli random variable. Thus the Pearson residuals won't help to diagnose overdispersion because it's impossible to have overdispersion in a binary logistic regression. It is possible, however, to have overdispersion in a binomial logistic regression where $Y_i \sim binomial(N, p_i)$ and $N>1$.

edited Apr 13 '17 at 12:44

Community

1

answered May 07 '14 at 22:12

jsk

2,810
1
12
25

Thanks for the anwer. I'm not certain I understand though... – Wilmerton May 08 '14 at 06:58
First, the $\chi^1$ test statistic formula I gave is quoted in many places. It is used in [this book](https://www.otexts.org/node/645), for instance, for a [goodness-of-fit](https://www.otexts.org/node/646). Second, I just now found [a book](http://statweb.stanford.edu/~tibs/ElemStatLearn/) which quote (p.144) the second formula in my question for the "Pearson chi-square statistic". It's nice that you provide a clear explanation of why the second formula provide a $\chi^1$ independent of $p_i$. +1 if I had the reputation... – Wilmerton May 08 '14 at 07:43
So, here is a reformulation of my question: why are the Pearson residuals defined differently? Is it context dependent? – Wilmerton May 08 '14 at 07:44
Take a look at the formula here. https://www.otexts.org/node/645 Note that the terms are normally $O_i$ and $E_i$ and almost all of the expected counts $E_i$ have to be larger than 5. See the rules in section 5.3. The Pearson residuals are used in generalized linear models, glm's, and are dependent upon the model chosen for the data (binomial, poisson, negative binomial, gamma, etc. each of which would have a different mean variance relationship, and thus a different denominator). They're defined differently because they are being used differently. – jsk May 08 '14 at 16:07
@Wilmerton The goal of creating Pearson residuals is to create standardized residuals that should be mean zero with variance of 1 if the model specified is true. I hope that I have been able to clear up some of your confusion. – jsk May 09 '14 at 03:10

Formula for computing the Pearson $\chi^2$, comparison with R

1 Answers1

Linked