Which family in glm() in R is most appropriate for a response variable of proportions?

Question

I currently have a data set where my dependent variable is a proportion (ex: the percent of a success). I currently have it set up where my y-values are between 0 and 1 (ex: 54.6% is 0.546). I decided to use logistic regression because my y-variable is between 0 and 1. However, since my y-variable is not binary, I assume I cannot use the family "binomial" in glm() R.

It is also worthy to note that I am doing a class project so dispersion value is important as well as AIC. When I use the binomial family, R automatically makes the dispersion value as 1, which does not help me figure out if my data is overdispersed.

Questions:

Is logistic regression appropriate to use in this scenario?
If the binomial family (link = logit) is appropriate as well for logistic reg, how should I scale/change the weight so that I can get an accurate dispersion value, as opposed to the assumed "1"?

I am open to other methods for going about this. Thanks!

UPDATE: I was able to do some more research, and one of my independent variable "total population" is the total for the percentage proportion (had to calculate it myself to make sure it aligns). My y-variable is still the percentage/proportion of success that the event happened within the population.

Where do the proportion values come from? Are any exactly 0 or 1? What are their distributions between 0 and 1? Error estimates in binomial logistic regression depend on the numbers of cases in each group, not just the proportion. Depending on the nature of your data, other types of analysis might be possible. Please provide that information by editing your question, as comments are easily overlooked and can be deleted. — EdM, Jul 30 '21 at 15:40
If you really don't have the counts/denominators, look into [Fractional outcome regression](https://m-clark.github.io/posts/2019-08-20-fractional-regression/). The tutorial I linked to shows examples with `R` and contrasts them with outputs from Stata. In short: `glm` with `family = binomial` is probably fine but the use robust standard errors is encouraged. — COOLSerdash, Jul 30 '21 at 15:45
See also https://stats.stackexchange.com/questions/530149/help-with-needed-with-fractional-outcomes-logit-regression — kjetil b halvorsen, Jul 30 '21 at 23:39

score 2 · Answer 1 · edited Jul 30 '21 at 23:43

As far as I know, you can use the binomial family if you make use of the weights argument. However, this only makes sense, if your dependent variable comes from a binomial distribution and is already weighted by the number of trials. A similar discussion can be found here: How to apply binomial GLMM (glmer) to percentages rather than yes-no counts?.

Alternatively, you could use a fractional logit model, where the dependent variable is a fraction. A reference on that can be found here: ECONOMETRIC METHODS FOR FRACTIONAL RESPONSE VARIABLES WITH AN APPLICATION TO 401 (K) PLAN PARTICIPATION RATES. For the glm model type this can be implemented by changing the family to quasibinomial.

And regarding your 2nd question: I think the usual approach is looking at the deviance residuals. As far as I know, the ratio of the deviance residuals and the degrees of freedom is a decent approximation for the dispersion parameter. However, I cannot provide a reference for this right now.

Thanks! That discussion helped a lot. So basically, I can use the binomial. However, would I need to change my y-variable to the amount of people in the population that are a success, as opposed to keeping it a percentage? A little confused on that. When I use the total population as the weight while y-variable is a percentage, I get a HUGE AIC value. — Sarah, Jul 30 '21 at 19:05

Which family in glm() in R is most appropriate for a response variable of proportions?

1 Answers1