I have several binomial (logit) models that predict the probability of various binary outcomes (e.g., incurring any costs for health care treatment, having a chronic disease). My case and control groups are matched on average length of time enrolled in the study, but the time any given individual is enrolled can be highly variable, from 30 days to 9 years. I want to take this into account in the models and have read several explanations of whether, when, and how it is (or isn't) appropriate to include an offset in a binomial model (e.g., Using offset in binomial model to account for increased numbers of patients), but I am ultimately confused by what the best approach is.
Taking the example of one of my binary outcomes, the regression model glm(binary_outcome ~ enrolled_days, data = df, family = "binomial")
gives an unexponentiated coefficient of 0.0004 for enrolled_days
and an exponentiated coefficient of 1.0004, which to me indicates that I should use it as an offset if I believe the probability of the outcome increases proportionally to the number of days enrolled in the study. I think this is a fairly reasonable assumption in my case.
Am I correct in coming to this conclusion? If so, would I incorporate the variable enrolled_days
as glm(binary_outcome ~ predictors + offset(enrolled_days), data = df, family = "binomial")
? When I do this with my data, I get a warning message that the algorithm failed to converge and that fitted probabilities of 0 or 1 occurred. I don't understand why this would happen, since I have a large sample (~55,000 people), and the average enrolled_days
(and the min and max) is the same for both cases and controls.