32

I have the following result from running glm function.

How can I interpret the following values:

  • Null deviance
  • Residual deviance
  • AIC

Do they have something to do with the goodness of fit? Can I calculate some goodness of fit measure from these result such as R-square or any other measure?

Call:
glm(formula = tmpData$Y ~ tmpData$X1 + tmpData$X2 + tmpData$X3 + 
    as.numeric(tmpData$X4) + tmpData$X5 + tmpData$X6 + tmpData$X7)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.52628  -0.24781  -0.02916   0.25581   0.48509  

Coefficients:
                     Estimate Std. Error  t value Pr(>|t|)    
(Intercept         -1.305e-01  1.391e-01   -0.938   0.3482    
tmpData$X1         -9.999e-01  1.059e-03 -944.580   <2e-16 ***
tmpData$X2         -1.001e+00  1.104e-03 -906.787   <2e-16 ***
tmpData$X3         -5.500e-03  3.220e-03   -1.708   0.0877 .  
tmpData$X4         -1.825e-05  2.716e-05   -0.672   0.5017    
tmpData$X5          1.000e+00  5.904e-03  169.423   <2e-16 ***
tmpData$X6          1.002e+00  1.452e-03  690.211   <2e-16 ***
tmpData$X7          6.128e-04  3.035e-04    2.019   0.0436 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 0.08496843)

    Null deviance: 109217.71  on 3006  degrees of freedom
Residual deviance:    254.82  on 2999  degrees of freedom
  (4970 observations deleted due to missingness)
AIC: 1129.8

Number of Fisher Scoring iterations: 2
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
learner
  • 775
  • 3
  • 8
  • 13
  • 2
    I realize this was migrated from SO, where one would not normally look for information on these statistical terms. You have a great resource here! For example, see what you can learn from a search on some of your terms, like [AIC](http://stats.stackexchange.com/search?tab=votes&q=aic). A little time spent doing this should either fully answer your question or at least guide you to asking a more specific one. – whuber Dec 20 '12 at 23:56
  • Not related to gaussian glms, but if you have a bernoulli glm fitted to binary data, you cannot use the residual deviance to assess the model fit, because it turns out the data cancels out in the deviance formula. Now, you can use the *difference* of residual deviances in that case to compare two models, but not the residual deviance itself. – FisherDisinformation Jul 14 '16 at 19:53

3 Answers3

22

Use the Null Deviance and the Residual Deviance, specifically:

1 - (Residual Deviance/Null Deviance)

If you think about it, you're trying to measure the ratio of the deviance in your model to the null; how much better your model is (residual deviance) than just the intercept (null deviance). If that ratio is tiny, you're 'explaining' most of the deviance in the null; 1 minus that gets you your R-squared.

In your instance you'd get .998.

If you just call the linear model (lm) instead of glm it will explicitly give you an R-squared in the summary and you can see it's the same number.

With the standard glm object in R, you can calculate this as:

reg = glm(...)
with(summary(reg), 1 - deviance/null.deviance)
MichaelChirico
  • 1,270
  • 1
  • 9
  • 20
noLongerRandom
  • 321
  • 2
  • 3
  • 1
    Does this pseudo-R have a specific name? I can't find it in the literature. – Giulia Martini Nov 02 '20 at 09:46
  • 2
    @GiuliaMartini, I believe this is McFadden's pseudo r-squared – Forrest Apr 13 '21 at 00:58
  • So is a larger value or lower value preferred? – Ken Lee May 28 '21 at 16:08
  • @KenLee When the Residual Deviance is small relative to the Null, then the ratio will also be small and that proposed GOF measure would be closer to 1 (and "preferred"). There is no adjustment of that measure for model complexity. – DWin Nov 23 '21 at 01:05
19

The default error family for a glm model in (the language) R is Gaussian, so with the code submitted you are getting ordinary linear regression where $R^2$ is a widely accepted measure of "goodness of fit". The R glm function doesn't report the Nagelkerke-pseudo-"$R^2$" but rather the AIC (Akaike Information Criterion). In the case of an OLS model, the Nagelkerke GOF measure will be roughly the same as the $R^2$.

$$R^2_{\mathrm{GLM}}=1-\frac{(\sum_id_{i,\mathrm{model}}^2)^{2/N} }{(\sum_id_{i,\mathrm{null}}^2)^{2/N}} ~~~~~~~~.=.~~~~~~~~ 1-\frac{\mathit{SSE}/n[\mathrm{model}]}{\mathit{SST}/n[\mathrm{total}]} = R^2_{\mathrm{OLS}}$$

There is some debate about how such a measure on the LHS gets interpreted, but only when the models depart from the simpler Gaussian/OLS situation. But in GLMs where the link function may not be "identity", as was here, and the "squared error" may not have the same clear interpretation, so the Akaike Information Criterion is also reported because it appears to be more general. There are several other contenders in the GLM GOF sweepstakes with no clear winner.

You might want to consider not reporting a GOF measure if you are going to be using GLMs with other error structures: Which pseudo-$R^2$ measure is the one to report for logistic regression (Cox & Snell or Nagelkerke)?

DWin
  • 7,005
  • 17
  • 32
  • 8
    Where exactly is the " Nagelkerke-pseudo-"R2"" in the above output? – Tom Sep 25 '13 at 02:57
  • 1
    I'm echoing Tom's question. Where in the output is the Nagelkerke-pseudo-"R2", or how do I find it? I'm not looking for more information about the value, but rather where I can find it in R's output. There's nothing in the question's example output that looks to me like a goodness of fit value in the range [0-1], so I'm confused. – Kevin Apr 29 '15 at 00:38
  • See http://stats.stackexchange.com/questions/8511/how-to-calculate-pseudo-r2-from-rs-logistic-regression and http://stackoverflow.com/questions/6242818/generalized-r-squared-naglekerkes-r2 ... I don't see any R^2 in either the glm object or the summary output. I may have been thinking of the usual output from rms summary functions, since that is my favorite modeling environment. – DWin Apr 29 '15 at 00:54
8

If you are running a binary logistic model, you can also run the Hosmer Lemeshow Goodness of Fit test on your glm() model. Using the ResourceSelection library.

library(ResourceSelection)

model <- glm(tmpData$Y ~ tmpData$X1 + tmpData$X2 + tmpData$X3 + 
           as.numeric(tmpData$X4) + tmpData$X5 + tmpData$X6 + tmpData$X7, family = binomial)

summary(model)
hoslem.test(model$y, model$fitted)
dylanjm
  • 374
  • 2
  • 17
  • Though note that this is only works for binary dependent variable models (e.g. if OP had set `family = "binomial`. OP's example is linear regression. – Matthew Jul 14 '16 at 17:29
  • @Matthew This is true, I'm sorry I missed that. I've been using binary logistic regressions so much lately my brain just went to the `hoslem.test()` – dylanjm Jul 14 '16 at 17:42
  • Understandable :) I suggested an edit to your post but forgot to update the R code as well. You may want to change that just for clarity's sake. – Matthew Jul 14 '16 at 17:43