How to calculate pseudo R2 when using logistic regression on aggregated data files?

Question

I encountered a strange phenomenon when calculating pseudo R2 for logistic models when using aggregated files: the results are simply too good to be true. An example (but as far as I can see, every aggregated file offers similar problems):

 library(pscl)
 cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
               header=TRUE)

 head(cuse)
 cuse.fit <- glm( cbind(using, notUsing) ~ age + education + wantsMore, 
             family = binomial, data=cuse)

 summary(cuse.fit)
 pR2(cuse.fit)

The results are:

> summary(cuse.fit)

Call:
glm(formula = cbind(using, notUsing) ~ age + education + wantsMore, 
family = binomial, data = cuse)

Deviance Residuals: 
Min       1Q   Median       3Q      Max  
-2.5148  -0.9376   0.2408   0.9822   1.7333  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.8082     0.1590  -5.083 3.71e-07 ***
age25-29       0.3894     0.1759   2.214  0.02681 *  
age30-39       0.9086     0.1646   5.519 3.40e-08 ***
age40-49       1.1892     0.2144   5.546 2.92e-08 ***
educationlow  -0.3250     0.1240  -2.620  0.00879 ** 
wantsMoreyes  -0.8330     0.1175  -7.091 1.33e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 165.772  on 15  degrees of freedom
Residual deviance:  29.917  on 10  degrees of freedom
AIC: 113.43

Number of Fisher Scoring iterations: 4

> pR2(cuse.fit)
         llh      llhNull           G2     McFadden         r2ML 
 -50.7125647 -118.6401419  135.8551544    0.5725514    0.9997947 
       r2CU 
  0.9997950

The last three outcomes from pscl function pR2 present McFadden's pseudo r-squared, Maximum likelihood pseudo r-squared (Cox & Snell) and Cragg and Uhler's or Nagelkerke's pseudo r-squared. The calculation seems to be flawless, but the outcomes close to 1 seem to good to be true.

Using weight instead of cbind:

cuse2 = rbind(cuse,cuse)
cuse2$using.contraceptive=1
    cuse2$using.contraceptive[1:nrow(cuse)]=0
cuse2$freq = cuse2$notUsing
cuse2$freq[1:nrow(cuse)] = cuse2$using[1:nrow(cuse)]
cuse.fit2 = glm(using.contraceptive ~ age + education + wantsMore,
            weight=freq, family = binomial, data = cuse2)
summary(cuse.fit2)
round(pR2(cuse.fit2),5)

produces different logistic regression coefficients, and slightly different pseudo R2's for r2ml and r2CU and a large difference for McFadden R2:

> round(pR2(cuse.fit2),5)
         llh     llhNull          G2    McFadden        r2ML 
  -933.91920 -1001.84677   135.85515     0.06780     0.98567 
        r2CU 
     0.98567

Full expansion results in very different estimates from pR2:

 cuse3 = rbind(cuse[rep(1:nrow(cuse), cuse[["notUsing"]]), ],
          cuse[rep(1:nrow(cuse), cuse[["using"]]), ])
 cuse3$using.contraceptive=1
     cuse3$using.contraceptive[1:sum(cuse$notUsing)]=0
 summary(cuse3)
 cuse.fit3 = glm(using.contraceptive ~ age + education + wantsMore,
            family = binomial, data = cuse3)
 summary(cuse.fit3)
 round(pR2(cuse.fit3),5)

 > round(pR2(cuse.fit3),5)
         llh     llhNull          G2    McFadden        r2ML 
  -933.91920 -1001.84677   135.85515     0.06780     0.08106 
        r2CU 
     0.11376

This indicates a logistic model which explains very little, which is a little bit more believable than the near perfect results from the aggregated files. Is there a more correct, and preferably more consistent, way to calculate the pseudo R2's?

I am a bit surprised by the lack of interest within the statistical community. Aggregated or tabulated data are regularly used for logistic regressions and an unsuspecting user may think he has a reasonable or even excellent model after calculation of Pseudo R2 indices of aggregated data, while in fact the model may be pretty useless. — tak101, Feb 02 '16 at 09:20
The cross validated community is rather small, so it might just have slipped by. — kjetil b halvorsen, Feb 02 '16 at 10:33

score 2 · Accepted Answer · answered Feb 02 '16 at 09:46

Following my own suggestion, I think I have found the solution. I found out that for the recalculation of the log-likelihoods, it is sufficient to expand the binary observations y and the predictions:

Pseudo.R2=function(object){
  stopifnot(object$family$family == "binomial")
  object0 = update(object, ~ 1)
  wt <- object$prior.weights # length(wt)
      y = object$y # weighted
  ones = round(y*wt)
  zeros = wt-ones
  fv <- object$fitted.values   # length(fv)
      if (is.null(object$na.action)) fv0 <- object0$fitted.values else
        fv0 <- object0$fitted.values[-object$na.action] # object may have missing values
  resp <- cbind(ones, zeros)
  Y <- apply(resp, 1, function(x) {c(rep(1, x[1]), rep(0, x[2]))} )
  if (is.list(Y)) Y <- unlist(Y) else Y <- c(Y)
  # length(Y); sum(Y)
  fv.exp <- c(apply(cbind(fv, wt), 1, function(x) rep(x[1], x[2])))
  if (is.list(fv.exp)) fv.exp <- unlist(fv.exp) else fv.exp <- c(fv.exp)
  # length(fv.exp)
  fv0.exp <- c(apply(cbind(fv0, wt), 1, function(x) rep(x[1], x[2])))
  if (is.list(fv0.exp)) fv0.exp <- unlist(fv0.exp) else fv0.exp <- c(fv0.exp)
  (ll = sum(log(dbinom(x=Y,size=1,prob=fv.exp))))
  (ll0 = sum(log(dbinom(x=Y,size=1,prob=fv0.exp))))

  n <- length(Y)
  G2 <- -2 * (ll0 - ll)
  McFadden.R2 <- 1 - ll/ll0
  CoxSnell.R2 <- 1 - exp((2 * (ll0 - ll))/n) # Cox & Snell / Maximum likelihood pseudo r-squared
  r2ML.max <- 1 - exp(ll0 * 2/n)
  Nagelkerke.R2 <- CoxSnell.R2/r2ML.max  # Nagelkerke / Cragg & Uhler's pseudo r-squared

  out <- c(llh = ll, llhNull = ll0, G2 = G2, McFadden = McFadden.R2,
           r2ML = CoxSnell.R2, r2CU = Nagelkerke.R2)
  out
}

Old results (using pR2 from the pscl package):

> round(pR2(cuse.fit),4)
llh   llhNull        G2  McFadden      r2ML      r2CU 
-50.7126 -118.6401  135.8552    0.5726    0.9998    0.9998 
> round(pR2(cuse.fit2),4)
llh    llhNull         G2   McFadden       r2ML       r2CU 
-933.9192 -1001.8468   135.8552     0.0678     0.9857     0.9857 
> round(pR2(cuse.fit3),4)
llh    llhNull         G2   McFadden       r2ML       r2CU 
-933.9192 -1001.8468   135.8552     0.0678     0.0811     0.1138

New results, using Pseudo.R2:

> round(Pseudo.R2(cuse.fit),4)
llh    llhNull         G2   McFadden       r2ML       r2CU 
-933.9192 -1001.8468   135.8552     0.0678     0.0811     0.1138 
> round(Pseudo.R2(cuse.fit2),4)
llh    llhNull         G2   McFadden       r2ML       r2CU 
-933.9192 -1001.8468   135.8552     0.0678     0.0811     0.1138 
> round(Pseudo.R2(cuse.fit3),4)
llh    llhNull         G2   McFadden       r2ML       r2CU 
-933.9192 -1001.8468   135.8552     0.0678     0.0811     0.1138

Now the results are consistent, and no longer dependent on their level of aggregation (tabulation). I have notified the maintainer of the pscl package. Maybe he has some interest.

By the way, for comparing the log-likelihoods of different model, there is no need to expand the data, as long as the models have the same aggregation level. — tak101, Feb 02 '16 at 10:04

score 1 · Answer 2 · edited May 23 '17 at 12:39

I found a partial solution. Studying the code of the pR2 function (https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function):

methods(pR2)
getAnywhere(pR2.glm)
getAnywhere(pR2Work)

I found an alternative way to use the correct number of observation by using object$prior.weights. Combining pR2.glm and pR2Work:

 my.pR2 <- function (object, ...) 
 {
     llh <- logLik(object)
     objectNull <- update(object, ~1)
     llhNull <- logLik(objectNull)
     # n <- dim(object$model)[1]
     n <- sum(object$prior.weights) # replacement of previous line
     McFadden <- 1 - llh/llhNull
     G2 <- -2 * (llhNull - llh)
     r2ML <- 1 - exp(-G2/n)
     r2ML.max <- 1 - exp(llhNull * 2/n)
     r2CU <- r2ML/r2ML.max
     out <- c(llh = llh, llhNull = llhNull, G2 = G2, McFadden = McFadden, 
       r2ML = r2ML, r2CU = r2CU)
     out
 }

The results are consistent for the weighted file and the unweighted file:

 > round(my.pR2(cuse.fit3),4) # unweighted
       llh    llhNull         G2   McFadden       r2ML       r2CU 
 -933.9192 -1001.8468   135.8552     0.0678     0.0811     0.1138 

 > round(my.pR2(cuse.fit2),4)
        llh    llhNull         G2   McFadden       r2ML       r2CU 
  -933.9192 -1001.8468   135.8552     0.0678     0.0811     0.1138

But the fit when using cbind with glm still gives problems:

 > round(my.pR2(cuse.fit),4)
       llh   llhNull        G2  McFadden      r2ML      r2CU 
  -50.7126 -118.6401  135.8552    0.5726    0.0811    0.5905

Both McFadden and r2CU seems too high. The question seems to boil down to the weighted recalculation of llh and llhNull, but I have not figured out how to do that.

score 1 · Answer 3 · edited Apr 13 '17 at 12:44

1

You may be interested in how to calculate it by your self, and be careful with the out-of-sample Pseudo-$R^2$, which could be a issue if you simply rely on packages.

Attached is my solution: https://stats.stackexchange.com/a/273208/128860

edited Apr 13 '17 at 12:44

Community

1

answered Apr 11 '17 at 23:51

Xiaorui Zhu

381
4
9

How to calculate pseudo R2 when using logistic regression on aggregated data files?

3 Answers3

Linked