logistic regression vs. bayesian approach

Question

I am working with birds dataset to determine success or failure of these introduced species and the effect of response variables on such. A sample from the final dataset i work with below:

dput(birds[1:20,])
structure(list(status = c(1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L), length = c(1520L, 
1250L, 870L, 720L, 820L, 770L, 50L, 570L, 580L, 480L, 470L, 450L, 
435L, 275L, 256L, 230L, 330L, 330L, 300L, 180L), mass = c(9600, 
5000, 3360, 2517, 3170, 4390, 1930, 1020, 910, 590, 539, 940, 
684, 230, 162, 170, 501, 439, 386, 95), range = c(1.21, 0.56, 
0.07, 1.1, 3.45, 2.96, 0.01, 9.01, 7.9, 4.33, 1.04, 2.17, 4.81, 
0.31, 0.24, 0.77, 2.23, 0.22, 2.4, 0.69), migr = c(1L, 1L, 1L, 
3L, 3L, 2L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L), insect = c(12L, 0L, 0L, 12L, 0L, 0L, 0L, 6L, 6L, 0L, 12L, 
12L, 12L, 3L, 3L, 3L, 3L, 3L, 3L, 12L), diet = c(2L, 1L, 1L, 
2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 
2L)), .Names = c("status", "length", "mass", "range", "migr", 
"insect", "diet"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 22L
), class = "data.frame")

My question is: How do i decide which link function to use ?. I have run all possible logit models:

  # logistic regression with binary response
  frec.logit <- glm(status ~ ., data=birds, family=binomial())
  frec.probit <- glm(status ~ ., data=birds, family=binomial("probit"))
  frec.cauchit <- glm(status ~ ., data=birds, family=binomial("cauchit"))
  frec.cll <- glm(status ~ ., data=birds, family=binomial("cloglog"))

  # bayesian
  bay.mod1 <- MCMClogit(status ~ . , family = binomial, data = birds)

  # comparison
  anova(frec.logit, frec.probit, frec.cauchit, frec.cll, bay.mod1)

Here I run into couple of problems.

when running the clog log binomial, I get an error :

Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred

However, I still get an output. Later on in the anova comparison there are only 4 models, I assume the clog log model has not been included. I am aware that the complementary log log model is used for survival times and is not appropriate here?

> anova(frec.logit, frec.probit, frec.cauchit, frec.cll, bay.mod1)
Analysis of Deviance Table

Model 1: status ~ length + mass + range + migr + insect + diet
Model 2: status ~ length + mass + range + migr + insect + diet
Model 3: status ~ length + mass + range + migr + insect + diet
Model 4: status ~ length + mass + range + migr + insect + diet
  Resid. Df Resid. Dev Df Deviance
1        65     79.973            
2        65     79.811  0  0.16261
3        65     80.582  0 -0.77075
4        65     80.802  0 -0.22064

Can I draw conclusions about the best link function based on this model such as: The deviance is lowest for model 3, corresponding to the cauchit model, therefore this will be used for upcoming variable selection, model adecuacy and prediction.

I have also noticed that some of estimated parameters differ using the frequestist approach from those obtained with bayesian approach (they do not lie withing the baysian CI) - even the sign is different. I thought both approaches should yield the same results?

See https://stats.stackexchange.com/questions/20523/difference-between-logit-and-probit-models/30909#30909 — Tim, Apr 20 '17 at 11:10

logistic regression vs. bayesian approach

0 Answers0