Invalid inference when observations are not independent

Question

I learned in elementary statistics that, with a general linear model, for inferences to be valid, observations must be independent. When clustering occurs, independence may no longer hold leading to invalid inference unless this is accounted for. One way to account for such clustering is by using mixed models. I would like to find an example dataset, simulated or not, which demonstrates this clearly. I tried using one of the sample datasets on the UCLA site for analysing clustered data

> require(foreign)
> require(lme4)
> dt <- read.dta("http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs.dta")

> m1 <- lm(api00~growth+emer+yr_rnd, data=dt)
> summary(m1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 740.3981    11.5522  64.092   <2e-16 ***
growth       -0.1027     0.2112  -0.486   0.6271    
emer         -5.4449     0.5395 -10.092   <2e-16 ***
yr_rnd      -51.0757    19.9136  -2.565   0.0108 * 


> m2 <- lmer(api00~growth+emer+yr_rnd+(1|dnum), data=dt)
> summary(m2)

Fixed effects:
             Estimate Std. Error t value
(Intercept) 748.21841   12.00168   62.34
growth       -0.09791    0.20285   -0.48
emer         -5.64135    0.56470   -9.99
yr_rnd      -39.62702   18.53256   -2.14

Unless I'm missing something, these results are similar enough that I wouldn't think the output from lm() is invalid. I have looked at some other examples (e.g. 5.2 from the Bristol University Centre for Multilevel Modelling) and found the standard errors are also not terribly different (I am not interested in the random effects themselves from the mixed model, but it is worth noting that the ICC from the mixed model output is 0.42).

So, my questions are 1) under what conditions will the standard errors be markedly different when clustering occurs, and 2) can someone provide an example of such a dataset (simulated or not).

@bayerj by clustering, I mean when observations that are similar to each other are grouped together within some kind of unit, for example 10 blood pressure measurements taken on 50 indivduals. — Joe King, Sep 30 '14 at 20:23

Steve · Accepted Answer · 2014-09-30T21:52:46.163

First of all, you are right this dataset maybe is not the best to understand the mixed model. But let's look first why

require(foreign)
dt <- read.dta("http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs.dta")

length(dt$dnum)          # 310
length(unique(dt$dnum))  # 187 
sum(table(dt$dnum)==1)   # 132

You see that you have 310 observations and 187 groups, of which 132 have only one observation. This does not mean that we should not use multi-level modelling, but just that we won't get very much different results as you stated.

Multi-level modelling motivation

The motivation to use multi-level modelling starts from the design itself, and not just from the results of the undertaken analysis. Of course the most common example is taking multiple observations from individuals, but to make things more extreme to give a more easily understanding situation, think asking individuals from different countries around the world about their income. So best examples are those that have a lot of heterogeneity, as taking clusters which are homogeneous in the examining outcome of course won't make much difference.

Example

So, let's simulate some data to make things clearer, simulation works better as with real life data is not that obvious. Imagine you take $10$ countries and you ask $100$ individuals from each country about their income y and something else x that has a positive effect in income with coefficient $0.5$.

set.seed(1)
I <- 100
J <- 10
n <- I*J
i <- rep(1:I, each=J)
j <- rep(1:J,I)
x <- rnorm(n,mean=0, sd=1)
beta0  <- 1000
beta1  <- 0.5
sigma2 <- 1
tau2   <- 200
u <- rep(rnorm(I,mean=0,sd=sqrt(tau2)),each=J)
y <- beta0 + beta1*x + u + rnorm(n,mean=0, sd=sqrt(sigma2))

So, running a linear model you get

> summary(lm(y~x))

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 999.8255     0.4609 2169.230   <2e-16 ***
x             0.5728     0.4456    1.286    0.199    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.57 on 998 degrees of freedom
Multiple R-squared:  0.001653,  Adjusted R-squared:  0.0006528 
F-statistic: 1.653 on 1 and 998 DF,  p-value: 0.1989

and you conclude that x has no statistical effect in y. See how big is the standard error. But running a random-intercept model

> summary(lmer(y~x + (1|i)))

Random effects:
 Groups   Name        Variance Std.Dev.
 i        (Intercept) 213.062  14.597  
 Residual               1.066   1.032  
Number of obs: 1000, groups:  i, 100

Fixed effects:
            Estimate Std. Error t value
(Intercept) 999.8247     1.4600   684.8
x             0.4997     0.0327    15.3

you see how much the standard error of the estimate has changed. Looking at the random effect part, we see how the variability has been decomposed - most of the variability in income is between countries, and within countries people have more similar incomes. In simple words, what happened here is that not accounting for the clustering the effect of x is "getting lost" (if we can use this kind of term), but decomposing the variability you find what you should actually get.

+1 Thank you, this is great. Though I'm sure I remember reading several times that SEs are typically smaller when failing to account for the clustering, so I'm still somewhat confused - what are the scenarios when the linear model would return a much too small SE ? — Joe King, Sep 30 '14 at 20:20
@JoeKing this is true for clustered robust SE, not for multilevel modelling. You can see that too in the page in ats.ucla where you took the data. — Steve, Sep 30 '14 at 20:43
@JoeKing to understand completely the difference look http://stats.stackexchange.com/questions/8291/clustered-standard-errors-and-multi-level-models — Steve, Sep 30 '14 at 21:03

Invalid inference when observations are not independent

1 Answers1