Logistic regression BIC: what's the right N?

Question

TL;DR: Which $N$ is correct for BIC in logistic regression, the aggregated binomial or Bernoulli $N$?

UPDATES AT BOTTOM

Suppose I have a data set to which I'd like to apply logistic regression. For the sake of example, suppose there are $j=5$ groups with $m=100$ participants each, for a total $n=500$. The outcome is 0 or 1. For example, the following data set (R code):

library(dplyr)
library(tidyr)


set.seed(45)
d <- tibble(y = rbinom(500, 1, .5),
            x = factor(rep(LETTERS[1:5], each = 100)))

There are two ways I can represent this: as is, above, treating every observation as a Bernoulli random variable, or aggregating observations within groups and treating each observation as Binomial. The number of rows in the data set will be 500 in the first instance, and 5 in the second.

I can construct the aggregated data set:

d %>% 
  group_by(x, y) %>% 
  summarise(n = n()) %>%
  spread(y, n) %>%
  rename(f = `0`, s = `1`) %>%
  mutate(n = s + f) -> d_agg

I can then fit the logistic regression using both data sets in R:

g_bern  <- glm(y ~ x,          data=d,     family=binomial)
g_binom <- glm(cbind(s,f) ~ x, data=d_agg, family=binomial)

UPDATE 2: We now fit the intercept only models:

g_bern0  <- glm(y ~ 1,          data=d,     family=binomial)
g_binom0 <- glm(cbind(s,f) ~ 1, data=d_agg, family=binomial)

and compute the AIC:

> AIC(g_bern)  
# [1] 694.6011
> AIC(g_binom)  
# [1] 35.22172

which of course differ by a constant

2*sum(lchoose(d_agg$n, d_agg$s))  # [1] 659.3794

as expected (see: Logistic Regression: Bernoulli vs. Binomial Response Variables).

However, the BICs differ by that constant AND a factor that depends on the "number of observations", and the number of observations differ in each:

> BIC(g_bern)    
# [1] 715.6742
> BIC(g_binom)  
# [1] 33.26891
> nobs(g_bern)   
# [1] 500
> nobs(g_binom)  
# [1] 5

Just to confirm, we can recalculate BIC for both:

> -2*logLik(g_bern) + attr(logLik(g_bern),"df")*log(nobs(g_bern))
# 'log Lik.' 715.6742 (df=5)
> -2*logLik(g_binom) + attr(logLik(g_binom),"df")*log(nobs(g_binom))
# 'log Lik.' 33.26891 (df=5)

and indeed the only place these two numbers differ is $N$.

UPDATE 2: When we try to assess the factor x, we see a disagreement that is ONLY attributable to the number of observations:

> BIC(g_bern0) - BIC(g_bern)
# [1] -17.66498
> BIC(g_binom0) - BIC(g_binom)
# [1] 0.7556999

UPDATE 2: As expected, the AICs are consistent:

> AIC(g_bern0) - AIC(g_bern)
# [1] -0.8065485
> AIC(g_binom0) - AIC(g_binom)
# [1] -0.8065485

This surprises me, since I would think that R would "know" which of the two to use to prevent ambiguity. It has the same information in both cases.

Which one is "right"? Or is BIC really this arbitrary?

UPDATE: I am not trying to compare the Bernoulli to the Binomial model. This is just a toy example. I have a set of comparisons where it matters which setup I use, because the penalties for $N$ are different. I have two sets of model comparisons and the winning model changes based on the $N$ penalty, even though these appear to me to be the same sets of models.

UPDATES 2 and 3: Added the comparisons to the intercept-only model and changed random seed to get a sign difference in the BIC.

score 6 · Answer 1 · answered Mar 09 '19 at 18:20

6

The BIC (and the AIC) are relative measures for comparing models. However, it makes no sense to compare what is otherwise the same model between using an aggregated vs. a disaggregated response. Nor would it make sense to compare models that would otherwise be different (e.g., different regressors), but where one model uses an aggregated response and the other model uses a disaggregated version of the response. As long as the two models being compared both represent the response variable in the same format, everything will be fine. Note that the two formats are ultimately equivalent—they contain the same information and mostly just look different on the outside, see: Input format for response in binomial glm in R.

answered Mar 09 '19 at 18:20

gung - Reinstate Monica

132,789
81
357
650

Just to clarify: I am not trying to compare g_bern to g_binom. This is a toy example. I have an application in which this penalty matters for which model wins. So everything is not fine in the sense that they are not equivalent. That is the point. (see my update) – Salad dressing Mar 10 '19 at 08:56
1

Then you need to describe the actual situation you are facing, not some toy example that differs. [XY questions](https://en.wikipedia.org/wiki/XY_problem) often lead to answers that don't address what the asker really needs to know. – gung - Reinstate Monica Mar 10 '19 at 13:47
I added an update showing that the comparisons with the intercept-only model are inconsistent between the approaches due to the number of observations being different. My question is which of the two seemingly equivalent approaches is "correct". – Salad dressing Mar 10 '19 at 21:24
1

@Saladdressing, I don't follow your update. Where is `g_bern0`, eg, defined? If the idea is that it was supposed to be an intercept-only model, I don't see a problem. The actual numbers of the BIC don't matter. It's just a relative measure. In both cases (ie, `g_bern0`, & `g_binom0`), the difference is negative, implying the same preference (for the intercept-only model). There is no inconsistency. – gung - Reinstate Monica Mar 11 '19 at 02:01
The difference in BIC between the two comparisons was an order of magnitude, which is huge on a logarithmic value such as BIC. And these are supposed to be the same models. But anyway, I incremented the seed. Now there is a sign inconsistency between the two BIC comparisons. I cannot see how "the two formats are ultimately equivalent—they contain the same information and mostly just look different on the outside" could possibly be true given that one BIC is -18, and the other is 1. This is a *staggering* difference. – Salad dressing Mar 11 '19 at 14:43

Isabella Ghement · Answer 2 · 2019-03-09T19:04:45.647

Interesting question! Coming at this from an applied setting, I think you need to remember that both BIC and AIC are measures of relative model fit.

In other words, these measures don't tell you much when you examine them for a single model, but can help you to select an appropriate model among a set of competing models. In particular:

If your goal is to find the 'best' among those competing models for prediction of the outcome variable, then select the model with the lowest AIC value;
If your goal is to find the 'best' among those competing models for understanding and describing the effects of the predictor variables included in the model on the outcome variable, then select the model with the lowest BIC value.

In defining your set of competing models, you would have to make sure the models follow the same conceptual framework. Thus, you would either compare several binomial logistic regression models or several binary logistic models, but not a mixture of both. (It is important to compare like with like, otherwise you won't know if a model won the competition based on its own merits or simply because you changed the model specification/fitting procedure.)

From this perspective, the only thing that matters is that R is consistent when computing the AIC and BIC across models of the same type (e.g., binomial logistic regression models).

Just to clarify: g_bern is a binary logistic regression model, whereas g_binom is a binomial logistic regression model. While they both model the probability of success in one trial, you wouldn't mix together variations of these models when defining your set of competing models (for the reasons explained above and also covered by @gung).

Please see my update. Just to clarify: I am not trying to compare g_bern to g_binom. This is a toy example. — Salad dressing, Mar 10 '19 at 08:57

Logistic regression BIC: what's the right N?

2 Answers2