The exercise is to predict the success rate for different groups of students. Here's sample data (using R):
require(data.table)
sampleDT <- data.table(Age = c(20,21,22), Gender = c("M","F","F")
, num_students = c(200,100,50), success = c(4,2,3))
sampleDT[, success_rate := success/num_students]
sampleDT
Age Gender num_students success success_rate
1: 20 M 200 4 0.02
2: 21 F 100 2 0.02
3: 22 F 50 3 0.06
Two GLM approaches are considered here:
- Logistic regression with 'num_students' as weights:
glm(success_rate ~ Age + Gender, data = sampleDT, family = binomial("logit"), weights = num_students)
- Poisson log-link regression:
glm(success ~ Age + Gender, data = sampleDT, family = poisson, weights = num_students)
We can just run these two models and see which one gives better result, but I would like to step a bit back and try to understand it. So here's my question(s):
- How would you decide which approach to take?
- What assumptions we're making if we take either of the approach?