1

I am trying to model a response variable which is a proportion (so a response between 0 and 1, see picture for distribution).

Ideally I would like to model it without using the actual counts, so as a decimal.

So far I have been using a binomial family in R. (The first approach proposed in this question How to fit a mixed model with response variable between 0 and 1?)

model <- glm(Response ~ 
                X1 +
                X2 + 
                X3,
              data = Training_data,
              family = 'binomial')

I think the model is doing okay, but when I use it for predictions it doesn't do a good job predicting when the ratio is 1 (As you can see from the picture).

I'm not sure if my approach of using a binomial distribution is wrong?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 2
    Think of the model as a way to get an appropriate average proportion, that average being conditional on the covariates. Even in principle the average will always be less than the maximum whenever conditional distributions aren't constant. Further, a logit link can't ever give you a prediction of exactly 1 or 0, at most or at least probabilities very near each limit. Otherwise put, even for plain regression, the distribution of predicted responses is never going to mimic the distributions of observed responses, unless you have a silly or vacuous model for which there is perfect prediction. – Nick Cox Apr 14 '20 at 08:02
  • 1
    The binomial link is right in principle, except that your standard errors are dubious. There should be a way in R to get better standard errors, but I don't know what it is. – Nick Cox Apr 14 '20 at 08:03
  • I am assuming that logit link is the default with your choices. – Nick Cox Apr 14 '20 at 08:04
  • Okay thanks, yea the logit link is the default. I suspected that the binomial was unable to give a prediction of exactly 0 or 1. Ill have a look into how to get better standard errors. Thankyou. – Jared Fowler Apr 14 '20 at 22:13
  • It's not impossible for a straight line fit to predict extreme values of the response perfectly, but if you think about examples that is a rare occurrence. Other way round, you could fit a ramp function with some effort, but rarely does a ramp function seem preferable to a sigmoid. – Nick Cox Apr 14 '20 at 22:18
  • Maybe beta regression? See https://stats.stackexchange.com/questions/117922/beta-regression – kjetil b halvorsen Apr 15 '20 at 17:52

0 Answers0