I learned in elementary statistics that, with a general linear model, for inferences to be valid, observations must be independent. When clustering occurs, independence may no longer hold leading to invalid inference unless this is accounted for. One way to account for such clustering is by using mixed models. I would like to find an example dataset, simulated or not, which demonstrates this clearly. I tried using one of the sample datasets on the UCLA site for analysing clustered data
> require(foreign)
> require(lme4)
> dt <- read.dta("http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs.dta")
> m1 <- lm(api00~growth+emer+yr_rnd, data=dt)
> summary(m1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 740.3981 11.5522 64.092 <2e-16 ***
growth -0.1027 0.2112 -0.486 0.6271
emer -5.4449 0.5395 -10.092 <2e-16 ***
yr_rnd -51.0757 19.9136 -2.565 0.0108 *
> m2 <- lmer(api00~growth+emer+yr_rnd+(1|dnum), data=dt)
> summary(m2)
Fixed effects:
Estimate Std. Error t value
(Intercept) 748.21841 12.00168 62.34
growth -0.09791 0.20285 -0.48
emer -5.64135 0.56470 -9.99
yr_rnd -39.62702 18.53256 -2.14
Unless I'm missing something, these results are similar enough that I wouldn't think the output from lm()
is invalid. I have looked at some other examples (e.g. 5.2 from the Bristol University Centre for Multilevel Modelling) and found the standard errors are also not terribly different (I am not interested in the random effects themselves from the mixed model, but it is worth noting that the ICC from the mixed model output is 0.42).
So, my questions are 1) under what conditions will the standard errors be markedly different when clustering occurs, and 2) can someone provide an example of such a dataset (simulated or not).