1

The exercise is to predict the success rate for different groups of students. Here's sample data (using R):

require(data.table)
sampleDT <- data.table(Age = c(20,21,22), Gender = c("M","F","F")
                       , num_students = c(200,100,50), success = c(4,2,3))
sampleDT[, success_rate := success/num_students]
sampleDT
   Age Gender num_students success success_rate
1:  20      M          200       4         0.02
2:  21      F          100       2         0.02
3:  22      F           50       3         0.06

Two GLM approaches are considered here:

  1. Logistic regression with 'num_students' as weights: glm(success_rate ~ Age + Gender, data = sampleDT, family = binomial("logit"), weights = num_students)
  2. Poisson log-link regression: glm(success ~ Age + Gender, data = sampleDT, family = poisson, weights = num_students)

We can just run these two models and see which one gives better result, but I would like to step a bit back and try to understand it. So here's my question(s):

  • How would you decide which approach to take?
  • What assumptions we're making if we take either of the approach?
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
LeGeniusII
  • 111
  • 2
  • 3
    Your second model is invalid. Poisson regression is a model for counts. The logistic regression is valid, because it models [probabilities of success](https://stats.stackexchange.com/questions/232979/logistic-regression-use-of-real-values-between-0-and-1as-opposed-to-two-clas/233003#233003), but such model specification is invalid for Poisson regression. – Tim Jan 25 '19 at 10:09
  • Did you mean probit instead of poisson? – ColorStatistics Jan 25 '19 at 18:51
  • @ColorStatistics I meant poisson. I thought "count/weight = probability of success" but I think I'm wrong. – LeGeniusII Jan 28 '19 at 23:05
  • @Tim, I'm not sure I would say that the Poisson model is "invalid." If the OP is trying to obtain the best predictions possible, but has no intention in performing inference, it's quite possible that the Poisson model performs better at predicting than the logistic model. In this case, the OP should use the Poisson model, if if the distributional assumptions are not met. Plus the Poisson model the user has fit is fitting counts, not probabilities (model 1 modeled success_rate and model 2 modeled success). – StatsStudent Feb 10 '19 at 15:46
  • @LeGeniusII are you still interested in an answer to this question? If so, can you tell us if you are interested in making inferences with the coefficients or if you are just interested in making the best predictions possible once you've selected your model? – StatsStudent Feb 10 '19 at 15:49
  • @StatsStudent Poisson regression is not meaningful in here, but by "invalid" I meant that such usage of weights would not work the same as with logistic regression where it directly accounts for sample size. Moreover, if it is a matter of model flexibility, then Poisson assumes that mean = variance, so there are much more flexible choices if it is just a matter of making better predictions. – Tim Feb 10 '19 at 17:07
  • I'm not sure I understand, @Tim. If $y=1$ leads consistently to the smallest mean squared predicted error on validation data, wouldn't that be the natural choice for model, regardless, if the OP simply wants to predict and has no intention of inference on the estimated coefficients? As an aside, the OP *is* modelling counts. How do we know without examining output if the model isn't a good fit and that mean = variance? – StatsStudent Feb 10 '19 at 17:17
  • @StatsStudent it's too long for comment, but there are more problems with Poisson regression in here. – Tim Feb 10 '19 at 17:39

0 Answers0