I suspect this question is really about basic definition, but I could not find the ressource I need to solve my problem.
I want to understand why the pearson $\chi^2$ test statistic, and corresponding residuals, are computed the way they are in R.
First, some tests:
>d<-data.frame(x=1:10000,y=sample(c(rep(1,100),0),10000,replace=T))
>M<-glm(y~x,family="binomial",data=d)
>d$p<-predict(M,type="response")
>chisq.test(table(d$p,d$y))
Pearson's Chi-squared test
data: table(d$p, d$y)
X-squared = 10000, df = 9999, p-value = 0.4953
Warning message:
In chisq.test(table(d$p, d$y)) :
l'approximation du Chi-2 est peut-être incorrecte
Ok, now an alternative that gives consistent results computed as in this answer
>sum(residuals(M,type="pearson")^2)
[1] 10000.75
Considering the formula
$$\chi^2 = \sum_{i \in \text{observations}} (O_i - P_i)^2/P_i$$
where the $O_i$ are the observed values and $P_i$ are the probabilities given by the model, I would have, perhaps naïvely, calculated
>sum((d$y-d$p)^2/d$p)
[1] 14
which provides another result. This is because the résiduals are different:
>head(residuals(M,type="pearson")^2)
1 2 3 4 5 6
0.001286902 0.001286924 0.001286946 0.001286968 0.001286989 0.001287011
>head((d$y-d$p)^2/d$p)
[1] 1.653989e-06 1.654045e-06 1.654101e-06 1.654157e-06 1.654213e-06 1.654269e-06
Going further, one discovers that the formula for the residuals is actually (see below for code)
$$\chi^2 = \sum_{i \in \text{observations}} (O_i - P_i)^2/(P_i(1-P_i))$$
which is much nicer (to my opinion) as it results in a constant $\chi^2$ as a function of $P_i$. But where does that come from?
Thanks!
----- how I found the last formula:
> getAnywhere(residuals.glm)
A single object matching ‘residuals.glm’ was found
It was found in the following places
package:stats
registered S3 method for residuals from namespace stats
namespace:stats
with value
function (object, type = c("deviance", "pearson", "working",
"response", "partial"), ...)
{
type <- match.arg(type)
y <- object$y
r <- object$residuals
mu <- object$fitted.values
wts <- object$prior.weights
switch(type, deviance = , pearson = , response = if (is.null(y)) {
mu.eta <- object$family$mu.eta
eta <- object$linear.predictors
y <- mu + r * mu.eta(eta)
})
res <- switch(type, deviance = if (object$df.residual > 0) {
d.res <- sqrt(pmax((object$family$dev.resids)(y, mu,
wts), 0))
ifelse(y > mu, d.res, -d.res)
} else rep.int(0, length(mu)), pearson = (y - mu) * sqrt(wts)/sqrt(object$family$variance(mu)),
working = r, response = y - mu, partial = r)
if (!is.null(object$na.action))
res <- naresid(object$na.action, res)
if (type == "partial")
res <- res + predict(object, type = "terms")
res
}
<bytecode: 0x000000000a27cfc0>
<environment: namespace:stats>
> M$family$variance
function (mu)
mu * (1 - mu)
<bytecode: 0x000000000a191810>
<environment: 0x0000000008b31710>