ICC as expected correlation between two randomly drawn units that are in the same group

Question

In multilevel modelling the intraclass correlation often gets calculated from a random-effects ANOVA

$$ y_{ij} = \gamma_{00} + u_j + e_{ij} $$

where $u_j$ are the level-2 residuals and $e_{ij}$ are the level-1 residuals. Then we obtain estimates, $\hat{\sigma}_u^2$ and $\hat{\sigma}_e^2$ for the variance of $u_j$ and $e_{ij}$ respectively, and plug them into the following equation:

$$ ρ = \frac{\hat{\sigma}_u^2}{\hat{\sigma}_u^2 +\hat{\sigma}_e^2} $$

Hox (2002) writes on p15 that

The intraclass correlation ρ can also be interpreted as the expected correlation between two randomly drawn units that are in the same group

There's a question here that asks an advanced question (why it is exactly equal to this instead of approximately equal) and gets an advanced answer.

However, I wish to ask a much simpler question.

Question: What does it even mean to talk about a correlation between two randomly drawn units that are in the same group?

I have a basic understanding of the fact that the intraclass correlation works on groups and not on paired data. However, I still don't understand how the correlation could be calculated if all we had was two randomly drawn units from the same group. If I look at the dot plots on the Wikipedia page for ICC, for example, we have multiple groups and multiple points within each group.

Wolfgang · Accepted Answer · 2016-05-20T22:13:44.057

It may be easiest to see the equivalence if you consider a case where there are only two individuals per group. So, let's go through a specific example (I'll use R for this):

dat <- read.table(header=TRUE, text = "
group person   y
1     1        5
1     2        6
2     1        3
2     2        2
3     1        7
3     2        9
4     1        2
4     2        2
5     1        3
5     2        5
6     1        6
6     2        9
7     1        4
7     2        2
8     1        8
8     2        7")

So, we have 8 groups with 2 individuals each. Now let's fit the random-effects ANOVA model:

library(nlme)
res <- lme(y ~ 1, random = ~ 1 | group, data=dat, method="ML")

And finally, let's compute the ICC:

getVarCov(res)[1] / (getVarCov(res)[1] + res$sigma^2)

This yields: 0.7500003 (it's 0.75 to be exact, but there is some slight numerical impression in the estimation procedure here).

Now let's reshape the data from the long format into the wide format:

dat <- as.matrix(reshape(dat, direction="wide", v.names="y", idvar="group", timevar="person"))

It looks like this now:

   group y.1 y.2
1      1   5   6
3      2   3   2
5      3   7   9
7      4   2   2
9      5   3   5
11     6   6   9
13     7   4   2
15     8   8   7

And now compute the correlation between y.1 and y.2:

cor(dat[,2], dat[,3])

This yields: 0.8161138

Wait, what? What's going on here? Shouldn't it be 0.75? Not quite! What I have computed above is not the ICC (intraclass correlation coefficient), but the regular Pearson product-moment correlation coefficient, which is an interclass correlation coefficient. Note that in the long-format data, it is entirely arbitrary who is person 1 and who is person 2 -- the pairs are unordered. You could reshuffle the data within groups and you would get the same results. But in the wide-format data, it is not arbitrary who is listed under y.1 and who is listed under y.2. If you were to switch around some of the individuals, you would get a different correlation (except if you were to switch around all of them -- then this is equivalent to cor(dat[,3], dat[,2]) which of course still gives you 0.8161138).

What Fisher pointed out is a little trick to get the ICC with the wide-format data. Have every pair be included twice, in both orders, and then compute the correlation:

dat <- rbind(dat, dat[,c(1,3,2)])
cor(dat[,2], dat[,3])

This yields: 0.75.

So, as you can see, the ICC is really a correlation coefficient -- for the "unpaired" data of two individuals from the same group.

If there were more than two individuals per group, you can still think of the ICC in that way, except that there would be more ways of creating pairs of individuals within groups. The ICC is then the correlation between all possible pairings (again in an unordered way).

Jake Westfall · Answer 2 · 2016-05-22T18:47:04.683

@Wolfgang already gave a great answer. I want to expand on it a little to show that you can also arrive at the estimated ICC of 0.75 in his example dataset by literally implementing the intuitive algorithm of randomly selecting many pairs of $y$ values -- where the members of each pair come from the same group -- and then simply computing their correlation. And then this same procedure can easily be applied to datasets with groups of any size, as I'll also show.

First we load @Wolfgang's dataset (not shown here). Now let's define a simple R function that takes a data.frame and returns a single randomly selected pair of observations from the same group:

get_random_pair <- function(df){
  # select a random row
  i <- sample(nrow(df), 1)
  # select a random other row from the same group
  # (the call to rep() here is admittedly odd, but it's to avoid unwanted
  # behavior when the first argument to sample() has length 1)
  j <- sample(rep(setdiff(which(dat$group==dat[i,"group"]), i), 2), 1)
  # return the pair of y-values
  c(df[i,"y"], df[j,"y"])
}

Here's an example of what we get if we call this function 10 times on @Wolfgang's dataset:

test <- replicate(10, get_random_pair(dat))
t(test)
#       [,1] [,2]
#  [1,]    9    6
#  [2,]    2    2
#  [3,]    2    4
#  [4,]    3    5
#  [5,]    3    2
#  [6,]    2    4
#  [7,]    7    9
#  [8,]    5    3
#  [9,]    5    3
# [10,]    3    2

Now to estimate the ICC, we just call this function a large number of times and then compute the correlation between the two columns.

random_pairs <- replicate(100000, get_random_pair(dat))
cor(t(random_pairs))
#           [,1]      [,2]
# [1,] 1.0000000 0.7493072
# [2,] 0.7493072 1.0000000

This same procedure can be applied, with no modifications at all, to datasets with groups of any size. For example, let's create a dataset consisting of 100 groups of 100 observations each, with the true ICC set to 0.75 as in @Wolfgang's example.

set.seed(12345)
group_effects <- scale(rnorm(100))*sqrt(4.5)
errors <- scale(rnorm(100*100))*sqrt(1.5)
dat <- data.frame(group = rep(1:100, each=100),
                  person = rep(1:100, times=100),
                  y = rep(group_effects, each=100) + errors)

stripchart(y ~ group, data=dat, pch=20, col=rgb(0,0,0,.1), ylab="group")

Estimating the ICC based on the variance components from a mixed model, we get:

library("lme4")
mod <- lmer(y ~ 1 + (1|group), data=dat, REML=FALSE)
summary(mod)
# Random effects:
#  Groups   Name        Variance Std.Dev.
#  group    (Intercept) 4.502    2.122   
#  Residual             1.497    1.223   
# Number of obs: 10000, groups:  group, 100

4.502/(4.502 + 1.497)
# 0.7504584

And if we apply the random pairing procedure, we get

random_pairs <- replicate(100000, get_random_pair(dat))
cor(t(random_pairs))
#           [,1]      [,2]
# [1,] 1.0000000 0.7503004
# [2,] 0.7503004 1.0000000

which closely agrees with the variance component estimate.

Note that while the random pairing procedure is kind of intuitive, and didactically useful, the method illustrated by @Wolfgang is actually a lot smarter. For a dataset like this one of size 100*100, the number of unique within-group pairings (not including self-pairings) is 505,000 -- a big but not astronomical number -- so it is totally possible for us to compute the correlation of the fully exhausted set of all possible pairings, rather than needing to sample randomly from the dataset. Here's a function to retrieve all possible pairings for the general case with groups of any size:

get_all_pairs <- function(df){
  # do this for every group and combine the results into a matrix
  do.call(rbind, by(df, df$group, function(group_df){
    # get all possible pairs of indices
    i <- expand.grid(seq(nrow(group_df)), seq(nrow(group_df)))
    # remove self-pairings
    i <- i[i[,1] != i[,2],]
    # return a 2-column matrix of the corresponding y-values
    cbind(group_df[i[,1], "y"], group_df[i[,2], "y"])
  }))
}

Now if we apply this function to the 100*100 dataset and compute the correlation, we get:

cor(get_all_pairs(dat))
#           [,1]      [,2]
# [1,] 1.0000000 0.7504817
# [2,] 0.7504817 1.0000000

Which agrees well with the other two estimates, and compared to the random pairing procedure, is much faster to compute, and should also be a more efficient estimate in the sense of having less variance.

ICC as expected correlation between two randomly drawn units that are in the same group

2 Answers2

Linked