TL;DR: Which $N$ is correct for BIC in logistic regression, the aggregated binomial or Bernoulli $N$?
UPDATES AT BOTTOM
Suppose I have a data set to which I'd like to apply logistic regression. For the sake of example, suppose there are $j=5$ groups with $m=100$ participants each, for a total $n=500$. The outcome is 0 or 1. For example, the following data set (R code):
library(dplyr)
library(tidyr)
set.seed(45)
d <- tibble(y = rbinom(500, 1, .5),
x = factor(rep(LETTERS[1:5], each = 100)))
There are two ways I can represent this: as is, above, treating every observation as a Bernoulli random variable, or aggregating observations within groups and treating each observation as Binomial. The number of rows in the data set will be 500 in the first instance, and 5 in the second.
I can construct the aggregated data set:
d %>%
group_by(x, y) %>%
summarise(n = n()) %>%
spread(y, n) %>%
rename(f = `0`, s = `1`) %>%
mutate(n = s + f) -> d_agg
I can then fit the logistic regression using both data sets in R:
g_bern <- glm(y ~ x, data=d, family=binomial)
g_binom <- glm(cbind(s,f) ~ x, data=d_agg, family=binomial)
UPDATE 2: We now fit the intercept only models:
g_bern0 <- glm(y ~ 1, data=d, family=binomial)
g_binom0 <- glm(cbind(s,f) ~ 1, data=d_agg, family=binomial)
and compute the AIC:
> AIC(g_bern)
# [1] 694.6011
> AIC(g_binom)
# [1] 35.22172
which of course differ by a constant
2*sum(lchoose(d_agg$n, d_agg$s)) # [1] 659.3794
as expected (see: Logistic Regression: Bernoulli vs. Binomial Response Variables).
However, the BICs differ by that constant AND a factor that depends on the "number of observations", and the number of observations differ in each:
> BIC(g_bern)
# [1] 715.6742
> BIC(g_binom)
# [1] 33.26891
> nobs(g_bern)
# [1] 500
> nobs(g_binom)
# [1] 5
Just to confirm, we can recalculate BIC for both:
> -2*logLik(g_bern) + attr(logLik(g_bern),"df")*log(nobs(g_bern))
# 'log Lik.' 715.6742 (df=5)
> -2*logLik(g_binom) + attr(logLik(g_binom),"df")*log(nobs(g_binom))
# 'log Lik.' 33.26891 (df=5)
and indeed the only place these two numbers differ is $N$.
UPDATE 2: When we try to assess the factor x
, we see a disagreement that is ONLY attributable to the number of observations:
> BIC(g_bern0) - BIC(g_bern)
# [1] -17.66498
> BIC(g_binom0) - BIC(g_binom)
# [1] 0.7556999
UPDATE 2: As expected, the AICs are consistent:
> AIC(g_bern0) - AIC(g_bern)
# [1] -0.8065485
> AIC(g_binom0) - AIC(g_binom)
# [1] -0.8065485
This surprises me, since I would think that R would "know" which of the two to use to prevent ambiguity. It has the same information in both cases.
Which one is "right"? Or is BIC really this arbitrary?
UPDATE: I am not trying to compare the Bernoulli to the Binomial model. This is just a toy example. I have a set of comparisons where it matters which setup I use, because the penalties for $N$ are different. I have two sets of model comparisons and the winning model changes based on the $N$ penalty, even though these appear to me to be the same sets of models.
UPDATES 2 and 3: Added the comparisons to the intercept-only model and changed random seed to get a sign difference in the BIC.