14

Suppose I have 10 students, who each attempt to solve 20 math problems. The problems are scored correct or incorrect (in longdata) and each student's performance can be summarized by an accuracy measure (in subjdata). Models 1, 2, and 4 below appear to produce different results, but I understand them to be doing the same thing. Why are they producing different results? (I included model 3 for reference.)

library(lme4)

set.seed(1)
nsubjs=10
nprobs=20
subjdata = data.frame('subj'=rep(1:nsubjs),'iq'=rep(seq(80,120,10),nsubjs/5))
longdata = subjdata[rep(seq_len(nrow(subjdata)), each=nprobs), ]
longdata$correct = runif(nsubjs*nprobs)<pnorm(longdata$iq/50-1.4)
subjdata$acc = by(longdata$correct,longdata$subj,mean)
model1 = lm(logit(acc)~iq,subjdata)
model2 = glm(acc~iq,subjdata,family=gaussian(link='logit'))
model3 = glm(acc~iq,subjdata,family=binomial(link='logit'))
model4 = lmer(correct~iq+(1|subj),longdata,family=binomial(link='logit'))
user20061
  • 453
  • 1
  • 3
  • 11
  • I also tried beta regression, but got an error... `library(betareg)` `model5 = betareg(acc~scale(iq),subjdata)` – user20061 Jan 25 '13 at 06:42
  • `library(car)` is necessary, for the logit function. – user20061 Jan 25 '13 at 17:53
  • 1
    It may help you to read two of my answers to related questions: [Difference between logit and probit models](http://stats.stackexchange.com/questions/20523//30909#30909) (which discusses link functions & GLiMs in general--a comment at the end specifically addresses your 1 & 3), & [Difference between generalized linear models & generalized linear mixed models](http://stats.stackexchange.com/questions/32419//32421#32421) (which discusses how your 4 is different from 1 & 3). – gung - Reinstate Monica Jan 26 '13 at 04:07

2 Answers2

18

Models 1 and 2 are different because the first transforms the response & the 2nd transforms its expected value.

For Model 1 the logit of each response is Normally distributed $$\newcommand{\logit}{\operatorname{logit}}\logit Y_i\sim\mathrm{N}\left(\mu_i,\sigma^2\right)$$ with its mean being a linear function of the predictor & coefficent vectors. $$\mu_i=x_i'\beta$$ & therefore $$ Y_i=\logit^{-1}\left(x_i'\beta+\varepsilon_i\right)$$ For Model 2 the response itself is normally distributed $$\newcommand{\logit}{\operatorname{logit}} Y_i\sim\mathrm{N}\left(\mu_i,\sigma^2\right)$$ with the logit of its mean being a linear function of the predictor and coefficent vectors $$\logit\mu_i=x_i\beta$$ & therefore $$ Y_i=\logit^{-1}\left(x_i'\beta\right)+\varepsilon_i$$

So the variance structure will be different. Imagine simulating from Model 2: the variance will be independent of the expected value; & though the expected values of the responses will be between 0 & 1, the responses will not all be.

Generalized linear mixed models like your Model 4 are different again because they contain random effects: see here & here.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • Thank you very much -- this very clearly distinguishes model 1 and model 2. Your reasoning that model 2 predicts some accuracy scores (although not their expected values) to be outside [0,1] is especially helpful (and disqualifies it for my purposes). I believe a similar intuition can be used against model 1: its range of possible predicted accuracy scores fall in (0,1) not [0,1]. With a limited number of questions, a model should predict some accuracy scores to be 0 or 1, and a binomial distribution can do just that. – user20061 Jan 26 '13 at 07:20
  • 2
    Note you should usually fit the binomial GLM with logit link against raw data (your `longdata`), not the proportions as in your Model 3. – Scortchi - Reinstate Monica Jan 26 '13 at 16:14
9

+1 to @Scortchi, who has provided a very clear and concise answer. I want to make a couple of complementary points. First, for your second model, you are specifying that your response distribution is Gaussian (a.k.a., normal). This must be false, because each answer is scored as correct or incorrect. That is, each answer is a Bernoulli trial. Thus, your response distribution is a Binomial. This idea is accurately reflected in your code as well. Next, the probability that governs the response distribution is normally distributed, so the link ought to be probit, not logit. Lastly, if this were a real situation, you would need to account for random effects for both subjects and questions, as they are extremely unlikely to be identical. The way you generated these data, the only relevant aspect of each person is their IQ, which you have accounted for explicitly. Thus, there is nothing left over that needs to be accounted for by a random effect in the model. This is also true for the questions, because random variations in question difficulty are not part of the data generating process in your code.

I don't mean to be nitpicking here. I recognize that your setup is simply designed to facilitate your question, and it has served that purpose; @Scortchi was able to address your questions very directly, with minimal fuss. However, I point these things out because they offer additional opportunities to understand the situation you are grappling with, and because you may not have realized that your code matches some parts of your storyline but not others.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Thank you for such careful thoughts about my code. As someone who works with empirical data, I am proud to say I have no expertise in generating fake data, and it shows here in the shortcomings you have identified. Although, my novice level of understanding may also be revealing itself. – user20061 Jan 26 '13 at 07:32
  • Thanks gung, that extra info was useful and helps others (at least me) understand the whole situation a bit better. Getting a handle on the GLM approach is tough. – Christopher Poile Jun 07 '13 at 17:16