11

The likelihood ratio (a.k.a. deviance) $G^2$ statistic and lack-of-fit (or goodness-of-fit) test is fairly straightforward to obtain for a logistic regression model (fit using the glm(..., family = binomial) function) in R. However, it can be easy to have some cell counts end up low enough that the test is unreliable. One way to verify the reliability of the likelihood ratio test for lack of fit is to compare its test statistic and P-value to those of Pearson's chi square (or $\chi^2$) lack-of-fit test.

Neither the glm object nor its summary() method report the test statistic for Pearson's chi square test for lack of fit. In my search, the only thing I came up with is the chisq.test() function (in the stats package): its documentation says "chisq.test performs chi-squared contingency table tests and goodness-of-fit tests." However, the documentation is sparse on how to perform such tests:

If x is a matrix with one row or column, or if x is a vector and y is not given, then a goodness-of-fit test is performed (x is treated as a one-dimensional contingency table). The entries of x must be non-negative integers. In this case, the hypothesis tested is whether the population probabilities equal those in p, or are all equal if p is not given.

I'd imagine that you could use the y component of the glm object for the x argument of chisq.test. However, you can't use the fitted.values component of the glm object for the p argument of chisq.test, because you'll get an error: "probabilities must sum to 1."

How can I (in R) at least calculate the Pearson $\chi^2$ test statistic for lack of fit without having to run through the steps manually?

Macro
  • 40,561
  • 8
  • 143
  • 148
Firefeather
  • 1,807
  • 4
  • 14
  • 20

4 Answers4

15

The sum of the squared Pearson residuals is exactly equal to the Pearson $\chi^2$ test statistic for lack of fit. So if your fitted model (i.e., the glm object) is called logistic.fit, the following code would return the test statistic:

sum(residuals(logistic.fit, type = "pearson")^2)

See the documentation on residuals.glm for more information, including what other residuals are available. For example, the code

sum(residuals(logistic.fit, type = "deviance")^2)

will get you the $G^2$ test statistic, just the same as deviance(logistic.fit) provides.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Firefeather
  • 1,807
  • 4
  • 14
  • 20
  • I agree with Macro. If you wanted answers fro the group you should have waited to hear what others have to say first. Anything you might get now is biased by seeing your answer. Also, if you know the answer what are you trying to prove by doing this? – Michael R. Chernick Jun 01 '12 at 19:21
  • @Macro - Firefeather has posted four questions to this site (including this one) and answered three of them (including this one) him/herself, accepting one of his/her own answers once. A few more like this and I might start seeing a pattern! – jbowman Jun 01 '12 at 20:42
  • @jbowman, I can imagine answering your own question if you figured it out on your own before anyone else posted an answer but firefeather posted an answer literally less than 5 minutes after posting the question, it seems he/she was not in need of help, which is what made me ask why... I don't really understand the motivation... – Macro Jun 01 '12 at 22:31
  • @macro - I agree with you, and I really did mean to point out that this seems to be a pattern for the OP, although I guess that wasn't clear (-1 to me :) – jbowman Jun 02 '12 at 00:42
  • No -1 @jbowman! I got what you were saying - I was just making an additional remark – Macro Jun 02 '12 at 02:22
  • 3
    @Macro, please see this official link: http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/ (it's linked on the Ask a Question page in the checkbox label at the bottom: "Answer your own question – share your knowledge, Q&A-style"). I had this question while I was doing homework (having chosen to use R instead of Minitab, though Minitab was demonstrated in class), but I didn't have enough time to type up the question and wait for a response. I figured out this workaround, and decided to share it with the community. – Firefeather Jun 04 '12 at 14:30
  • Cool @Firefeather! I knew there was a role for answering your own question but I didn't know it was encouraged to post questions you already knew the answer to. It makes sense - thanks for the info. – Macro Jun 04 '12 at 14:35
  • 2
    @Macro, you're very welcome! I wish I could ask more questions where I don't provide the answer, and answer more questions that I didn't ask. But [jbowman](http://stats.stackexchange.com/users/7555/jbowman)'s right about a pattern: my contributions to the community **are** tending toward talking to myself. :) (At least I'm contributing somehow to the community, right?) – Firefeather Jun 04 '12 at 14:43
  • @Firefeather - definitely!! I Just found your question because this is exactly what I needed. Thanks! – Tomas Oct 21 '12 at 16:07
11

The Pearson statistic has a degenerate distribution so is not recommended in general for logistic model goodness-of-fit. I prefer structured tests (linearity, additivity). If you want an omnibus test see the single degree of freedom le Cessie - van Houwelingen - Copas - Hosmer unweighted sum of squares test as implemented in the R rms package residuals.lrm function.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 2
    -1: Thank you for the insight! However, this does not answer my question. Because it's relevant commentary/discussion on a statement I made in the background to my question, your answer probably belongs in a comment, instead of in an answer. – Firefeather Mar 18 '14 at 12:47
  • 2
    I guess that the four people who voted for my answer disagree with you. And you haven't dealt with the degenerate distribution. – Frank Harrell Mar 18 '14 at 15:45
  • @FrankHarrell: Is this GOF measure different than the Hosmer-Lemeshow (H-L) GOF test? Assuming so because of the name, and have also compared the two: Conducted H-L GOF test as found in the `ResourceSelection` package, and its result is different than what I get from running `resid(lrm_object, 'gof')` after fitting my logistic regression model as `lrm_object – Meg Nov 18 '15 at 01:05
  • 1
    The two are vastly different. The HL statistic (now obsolete) has fixed d.f. and is usually based on deciles of risk. The HL $\chi^2$ statistic thus is not degenerate as $N\rightarrow\infty$. On the other hand, beware of any $\chi^2$ statistic where the d.f. keeps expanding with $N$. – Frank Harrell Nov 18 '15 at 12:51
  • I would love to see a simulation that shows this degeneration. – wdkrnls Jul 03 '18 at 02:39
  • This is a well known result in math stat. Wish I had the reference. There are much better ways to look at GOF. – Frank Harrell Jul 03 '18 at 15:35
0

You can also use c_hat(mod) that will give the same output as sum(residuals(mod, type = "pearson")^2).

0

Thanks, I didn't realize it was as simple as: sum(residuals(f1, type="pearson")^2) However please note that Pearsons residual varies depending on whether it is calculated by covariate group or by individual. A simple example:

m1 is a matrix (this one is the head of a larger matrix):

m1[1:4,1:8]

    x1 x2 x3 obs    pi   lev  yhat y
obs  1  1 44   5 0.359 0.131 1.795 2
obs  0  1 43  27 0.176 0.053 4.752 4
obs  0  1 53  15 0.219 0.062 3.285 1
obs  0  1 33  22 0.140 0.069 3.080 3

Where x1-3 are predictors, obs is no. observations in each group, pi is probability of group membership (predicted from regression equation), lev is leverage, the diagonal of the hat matrix, yhat the predicted no. (of y=1) in the group and y the actual no.

This will give you Pearson's by group. Note how it's different if y==0: '$ fun1 <- function(j){ if (m1[j,"y"] ==0){ # y=0 for this covariate pattern Pr1 <- sqrt( m1[i,"pi"] / (1-m1[i,"pi"])) Pr2 <- -sqrt (m1[i,"obs"]) res <- round( Pr1 * Pr2, 3) return(res) } else { Pr1 <- m1[j,"y"] - m1[j,"yhat"] Pr2 <- sqrt( m1[j,"yhat"] * ( 1-(m1[j,"pi"]) ) ) res <- round( Pr1/Pr2, 3) return(res) } } $'

Thus

nr <-nrow(m1)
nr1 <- seq(1,nr)
Pr <- sapply(1:nrow[m1], FUN=fun1)
PrSj <- sum(Pr^2)

If there are large numbers of subjects with y=0 covariate patterns, then Pearons residual will be much larger when calculated using the 'by group' rather than the 'by indiviual' method.

See e.g. Hosmer & Lemeshow "Applied Logistic Regression", Wiley, 200.

chl
  • 50,972
  • 18
  • 205
  • 364
dardisco
  • 567
  • 7
  • 17