I am working with birds
dataset to determine success or failure of these introduced species and the effect of response variables on such. A sample from the final dataset i work with below:
dput(birds[1:20,])
structure(list(status = c(1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L), length = c(1520L,
1250L, 870L, 720L, 820L, 770L, 50L, 570L, 580L, 480L, 470L, 450L,
435L, 275L, 256L, 230L, 330L, 330L, 300L, 180L), mass = c(9600,
5000, 3360, 2517, 3170, 4390, 1930, 1020, 910, 590, 539, 940,
684, 230, 162, 170, 501, 439, 386, 95), range = c(1.21, 0.56,
0.07, 1.1, 3.45, 2.96, 0.01, 9.01, 7.9, 4.33, 1.04, 2.17, 4.81,
0.31, 0.24, 0.77, 2.23, 0.22, 2.4, 0.69), migr = c(1L, 1L, 1L,
3L, 3L, 2L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
2L), insect = c(12L, 0L, 0L, 12L, 0L, 0L, 0L, 6L, 6L, 0L, 12L,
12L, 12L, 3L, 3L, 3L, 3L, 3L, 3L, 12L), diet = c(2L, 1L, 1L,
2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
2L)), .Names = c("status", "length", "mass", "range", "migr",
"insect", "diet"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 22L
), class = "data.frame")
My question is: How do i decide which link function to use ?. I have run all possible logit models:
# logistic regression with binary response
frec.logit <- glm(status ~ ., data=birds, family=binomial())
frec.probit <- glm(status ~ ., data=birds, family=binomial("probit"))
frec.cauchit <- glm(status ~ ., data=birds, family=binomial("cauchit"))
frec.cll <- glm(status ~ ., data=birds, family=binomial("cloglog"))
# bayesian
bay.mod1 <- MCMClogit(status ~ . , family = binomial, data = birds)
# comparison
anova(frec.logit, frec.probit, frec.cauchit, frec.cll, bay.mod1)
Here I run into couple of problems.
when running the clog log binomial, I get an error :
Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred
However, I still get an output. Later on in the anova comparison there are only 4 models, I assume the clog log model has not been included. I am aware that the complementary log log model is used for survival times and is not appropriate here?
> anova(frec.logit, frec.probit, frec.cauchit, frec.cll, bay.mod1)
Analysis of Deviance Table
Model 1: status ~ length + mass + range + migr + insect + diet
Model 2: status ~ length + mass + range + migr + insect + diet
Model 3: status ~ length + mass + range + migr + insect + diet
Model 4: status ~ length + mass + range + migr + insect + diet
Resid. Df Resid. Dev Df Deviance
1 65 79.973
2 65 79.811 0 0.16261
3 65 80.582 0 -0.77075
4 65 80.802 0 -0.22064
Can I draw conclusions about the best link function based on this model such as: The deviance is lowest for model 3, corresponding to the cauchit model, therefore this will be used for upcoming variable selection, model adecuacy and prediction.
I have also noticed that some of estimated parameters differ using the frequestist approach from those obtained with bayesian approach (they do not lie withing the baysian CI) - even the sign is different. I thought both approaches should yield the same results?