GLM modeling binomial proportions with varying trials and probabilities

Question

A collection of coin manufacturers, $m$, each produces a line of coins, the number of which varies by manufacturer (some produce 3 types of coins, others make 7, and so on). Each manufacturer imparts a characteristic bias towards flipping heads on all their coins. Further, manufacturers produce many of the same types of coins, $c$, which have a fixed, unknown bias. We can estimate the impact of coin, $c$, through an external database of coin-specific biases. Lastly, the coins are pressed once each year, $a$, and each pressing of the coins has another unique bias shared across all manufacturers and coins. Each coin is flipped a variable number of times, $n_i$, and the number of heads could be represented as

\begin{align*} y_{i,m,c,a} \sim \textrm{Binom}(n_{i}, p_{m,c,a}) \end{align*}

Assuming a large data set of coin-flip trials across an identifiable set of conditions (i.e., year, manufacturer, coin type), I would like to estimate the $m$ biases. Our measured variables are:

number of heads, $y$
number of trials, $n_i$
manufacturer, $m$
coin-specific probabilities (taken from external database), $c$
year, $a$

I would like to estimate $m$ biases and perform a model comparison between nested models that have/don't have manufacturer info in a NHST framework. "Hits" will be manufacturers with large biases. My first thought was to use a generalized linear mixed model (GLMM), yet I'm having trouble figuring out how to specify the model, particularly given the varying number of trials. (For clarification, with simpler designs, I'm familiar with R syntax for using weights when the number of trials varies, function(y/n ~ ., ..., weights = n, ...).)

While I can see a Bayesian formulation for this model,

Please assume a large data set of 100,000 observations and many of these models to fit when proposing an inference strategy
For "historical reasons" (read: this test should look like an extension of an existing testing procedure), I'd like to use maximum likelihood.

I'm an applied scientist, so answers that provide references and code would be most helpful. Would Coull & Agresti 2000 be relevant to my case?

Is a mixed model necessary? I can't tell this is way to go. I would start with a glm. Maybe something like `glm(cbind(heads, flips) ~ m + year + offset(logit(c_bias)), family = binomial)` in R. I *think* the coin specific biases could be an offset in the model if they can comfortably treated as known from the external database. Welcome to CrossValidated! — JTH, May 26 '21 at 21:44
Thanks @JTH--and thanks for the comments. Agreed that your model is a reasonable starting point, and I'm curious to hear others' thoughts on fixed v random here, though I suspect I can make that call myself from the real data. — a crab, May 26 '21 at 22:03
A gap in my knowledge that, in part, motivated posting here is (1) how to specify this model as a multinomial/-variate response of the form `(y_1, y_2, y_3, ..., y_c) ~ m + year + (c_bias_1, c_bias_2, c_bias_3, ..., c_bias_c)` and (2) whether inference here would be similar to the model you proposed, akin to reformulating a multinomial as a poisson ('poisson trick') — a crab, May 26 '21 at 22:03

JTH · Answer 1 · 2021-05-27T18:21:34.967

A somewhat elegant, but perhaps naive approach, would be to take the coin bias data as trustworthy and known, and include those biases as offsets in a binomial generalized linear model. This automatically handles any imbalances in the number of trials. This also assumes that the effects for year and manufacturer are fixed effects, and that's really in the eye of the modeler. You can code this analysis in R, so I will demonstrate by creating fake data with the additive biases you indicated.

## Create factors for predictors
m <- factor(c("m1", "m2", "m3", "m4", "m5", "m6"))
y <- factor(c("y1", "y2", "y3", "y4", "y5"))
coin <- factor(c("c1", "c2", "c3", "c4"))

## Assume from bias from 1/2 due to coin
c.bias <- c(0, -0.1, 0.1, 0.2)

## Convert to logit scale for logistic glm
logit.c.bias <- qlogis(0.5 + c.bias)

## Create model matrix given predictor data
dat <- expand.grid(m = m, y = y, coin = coin)
X <- model.matrix( ~ m + y + coin, dat)

## Create hypothetical coefficient vector
beta <- c(0, # 1 intercept
          -3, -2, -1, 0, 1, # 5 m coefs, for example
          3, 2, 1, 0, # 4 y coefs, for example
          logit.c.bias[-1]) # 3 coin.biases

lin.pred <- X %*% beta
dat$logit.c.bias <- rep(logit.c.bias, each = 30)

## Sample 500 coin tosses from each coin * manufacturer * year
## For simplicity, I use constant number of trials, but this could be a vector of trials in your problem.
tosses <- 500
heads <- rbinom(nrow(X), tosses, plogis(lin.pred))
dat$heads <- heads

# fit model. Use offset() to include bias. Note, different formula from my comment, I correct typo. f1 correctly recovers parameters.
f1 <- glm(cbind(heads, tosses - heads) ~  m + y + offset(logit.c.bias),
          binomial, dat)

# fit without m
f2 <- update(f1, ~ . - m)

# Test that there is variation in coin bias among manufacturers
anova(f1, f2, test = "LRT")

This analysis could be extended depending on the real problem:

Random effects could be used to model the manufacturer biases

The formula for the random effects version might be

f3 <- glmer(cbind(heads, tosses - heads) ~  (1 | m) + y + offset(logit.c.bias),
          family = binomial, data = dat)

if you use lme4.

A prior distribution for coin biases could be determined from the coin bias database, which is updated based on the experimental data. This would be more realistic than using offsets for coin biases.

Thanks for the detailed response and providing a simulated data set! I didn't originally think of the coin bias data as an offset and therefore didn't appreciate that using it as an offset would handle varying numbers of trials, so this was very helpful. I'm still interested in trying other specifications of this model, as well as playing around with your specification on my data. I'll wait to accept and update here! — a crab, May 27 '21 at 03:49
Also-- Any thoughts on treating this as a multiple response model ([described in my comment on the original post](https://stats.stackexchange.com/questions/525936/glm-modeling-binomial-proportions-with-varying-trials-and-probabilities#comment968267_525936)? — a crab, May 27 '21 at 03:52
Doesn't *seem* like a multi response problem to me. AFAIK, you've got one outcome, # Heads, which is affected by a few covariates. — JTH, May 27 '21 at 18:21

GLM modeling binomial proportions with varying trials and probabilities

1 Answers1