Is using a Binomial GLM correct for comparing two groups probability of doing something?

Question

I'm currently a biology student and we use R occasionally for statistics. I'm currently looking at two groups chances of mating based on their size. I am using my own generated data for this and only the large group mates and the small group does not. My lecturer suggests using a Binomial GLM but using their code doesn't work and, to be honest, I am not sure what it is testing for. I need a test that compares chances of mating and if the difference is statistically different I think. This is the code that I am using. It works fine until the last line.

#this is 35 males who were chosed, with a mean claw size of 15
not.chosen.size<-rnorm(n=65,mean=5,sd=1.5)
#this is 65 males who were not chosen, with a mean claw size of 5

population.size<-c(chosen.size,not.chosen.size)
population.size


#add a column indicating whether they were chosen or not
yes<-rep(1,35)
no<-rep(0,65)
#combine them
chosen<-c(yes,no)

dat<-data.frame(population.size,chosen)
dat

mod<-glm(chosen~population.size,family=binomial,data=dat)

I would appreciate any help you can offer.

EDIT:

I've gotten these results by implementing the changes suggested by @EdM.

    Min       1Q   Median       3Q      Max  
-1.4430  -0.8965  -0.4957   0.9204   2.4367  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -4.4769     0.9991  -4.481 7.44e-06 ***
population.size   0.7105     0.1675   4.243 2.21e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 133.75  on 99  degrees of freedom
Residual deviance: 107.94  on 98  degrees of freedom
AIC: 111.94

Number of Fisher Scoring iterations: 4

EDIT: New Code:

chosen.size<-rnorm(n=35,mean=7,sd=1.5)
#this is 35 males who were chosed, with a mean claw size of 7
not.chosen.size<-rnorm(n=65,mean=5,sd=1.5)
#this is 65 males who were not chosen, with a mean claw size of 5

population.size<-c(chosen.size,not.chosen.size)
population.size

#add a column indicating whether they were chosen or not
yes<-rep(1,39)
no<-rep(0,61)
#combine them
chosen<-c(yes,no)

dat<-data.frame(population.size,chosen)
dat

mod<-glm(chosen~population.size,family=binomial,data=dat)

With the wide difference in mean claw size between the groups, I suspect that you are getting [perfect separation](https://stats.stackexchange.com/q/11109/28500) in your model. That is, the claw size is perfectly predicting mating success in your simulated data, so a standard logistic model can't be fit. Try claw sizes that are closer together with substantial overlap between the mating-success groups. — EdM, Jan 22 '22 at 18:07
@EdM, I adjusted the code to reflect what you said. These are the results, could you assist me in interpreting them ? `Coefficients: (Intercept) population.size -4.4769 0.7105 Degrees of Freedom: 99 Total (i.e. Null); 98 Residual Null Deviance: 133.7 Residual Deviance: 107.9 AIC: 111.9` — MiniDeece, Jan 22 '22 at 18:26
It's really hard to read those results in a comment. Please edit your original question to show your updated data/model and results. That will make it much easier for me and for others who might view your question. — EdM, Jan 22 '22 at 18:35
I'm starting on an answer, but in the meantime please edit the question to show how you changed your data simulation from the original. Also, when you simulate data it's best to use `set.seed()` with an argument that you show just before that, e.g `set.seed(1234)`, to allow others to reproduce your work exactly. — EdM, Jan 22 '22 at 19:03

score 0 · Answer 1 · answered Jan 22 '22 at 19:33

This thread goes into extensive detail about how to interpret the output from a logistic regression model.

Briefly, in logistic regression (generalized linear model, binomial family, logit link) intercepts and regression coefficients are in log-odds scales.

With default coding in R, the (Intercept) is the baseline log-odds, estimated when all continuous predictors are at values of 0 and categorical predictors are at their reference values. A regression coefficient is then the change in log-odds from the (Intercept) associated with a predictor.

In your case, the (Intercept) is the estimated log-odds of mating success when population.size = 0. That might not be a realistic value but that doesn't matter. You just use that to start the calculations. The population.size coefficient is the change in log-odds per unit change in population.size. For example, if population.size = 6 the log-odds of success are -4.4769 + (6 * 0.7105) = -0.2139.

Log-odds can be hard to think about but they are very useful for these models. To put the estimate into a somewhat easier form to grasp, first exponentiate the log-odds to get the odds of success: exp(-0.2139) = 0.807, the estimated odds of success at population.size = 6.

If $p$ is the probability of success, the the odds are $p/(1-p)$. You can then solve for $p$ as odds/(1+odds); you get an estimated probability of success of 0.447 at population.size = 6.

Statistical software can make these predictions for any set of predictor values, along with confidence intervals based on the estimated (co)variances among the regression coefficients. For more complicated models, particularly when there are interactions, it's wise not to focus on the apparent "significance" of the individual coefficients. Look instead at the overall quality of the model and predictions for particular sets of predictor values of interest.

Thanks, that's quite helpful. Can you offer any suggestion on how to plot this ? @EdM — MiniDeece, Jan 22 '22 at 20:35
@MiniDeece e.g., plot the predicted probabilities as a function of `population.size`. The help page for `predict.glm` has an example. It's best to include confidence limits, too. That's made easier (once you learn the syntax) by the [`rms` package](https://cran.r-project.org/package=rms), whose `Predict()` function does that for its related `Glm()` models (note capital "G") on data frames that have been summarized by a "datadist" option specifying ranges to display. E.g., `dd — EdM, Jan 22 '22 at 21:02

Is using a Binomial GLM correct for comparing two groups probability of doing something?

1 Answers1