Essentially, I have some covariate data X
, and a dependent variable Y
consisting of proportions of a sample that shown a certain response (i.e. between 0 and 1). I suspect I want to proceed via a GLM approach, but the thing is, I don't know the sizes of each of those samples!
My thoughts are to proceed by a quasibinomial methodology, estimating the dispersion parameter. Assuming the sizes of each sample are not too different, I can thus keep the logit link between the linear predictor and the proportion, but disregard the contribution of n
in the usual binomial variance of np(1-p)
. Then I can do hypothesis testing the usual way?
Does this make any sense?
Some R code:
#simulate some data
X = rnorm(500)
Z = rnorm(500)
p = exp(X*0.1 + 2)/(1+exp(X*0.1 + 2))
n = 50
Y = NULL
for (i in 1:length(X)){
Y = c(Y,sum(runif(n) < p[i])/n)
}
Y2 = cbind(Y*n, n-Y*n)
#glm, binomial model is 'true' ?
summary(glm(Y~X+Z, family = "quasibinomial"))
summary(glm(Y2~X+Z, family = "binomial"))
anova(glm(Y~X+Z, family = "quasibinomial"), test = "Chisq")
anova(glm(Y2~X+Z, family = "binomial"), test= "Chisq")
Seems to work, but am I missing something? Surely someone's done something like this before?