Intraclass Correlation Coefficients (ICC) with Multiple Variables

Question

Suppose I have measured some variable in siblings, which are nested within families. The data structure looks like this:

family sibling value
------ ------- -----
1      1       y_11
1      2       y_12
2      1       y_21
2      2       y_22
2      3       y_23
...    ...     ...

I want to know the correlation between measurements taken on siblings within the same family. The usual way of doing that is to calculate the ICC based on a random-intercept model:

res <- lme(yij ~ 1, random = ~ 1 | family, data=dat)
getVarCov(res)[[1]] / (getVarCov(res)[[1]] + res$s^2)

This would be equivalent to:

res <- gls(yij ~ 1, correlation = corCompSymm(form = ~ 1 | family), data=dat)

except that the latter approach also allows for a negative ICC.

Now suppose I have measured three items in siblings nested within families. So, the data structure looks like this:

family sibling item value
------ ------- ---- -----
1      1       1    y_111
1      1       2    y_112
1      1       3    y_113
1      2       1    y_121
1      2       2    y_122
1      2       3    y_123
2      1       1    y_211
2      1       2    y_212
2      1       3    y_213
2      2       1    y_221
2      2       2    y_222
2      2       3    y_223
2      3       1    y_231
2      3       2    y_232
2      3       3    y_233
...    ...     ...  ...

Now, I want to find out about:

the correlation between measurements taken on siblings within the same family for the same item
the correlation between measurements taken on siblings within the same family for different items

If I only had pairs of siblings within families, I would just do:

res <- gls(yijk ~ item, correlation = corSymm(form = ~ 1 | family), 
           weights = varIdent(form = ~ 1 | item), data=dat)

which gives me a $6 \times 6$ var-cov matrix on the residuals of the form:

$\left[\begin{array}{ccc|ccc} \sigma^2_1 & \rho_{12} \sigma_1 \sigma_2 & \rho_{13} \sigma_1 \sigma_3 & \phi_{11} \sigma^2_1 & \phi_{12} \sigma_1 \sigma_2 & \phi_{13} \sigma_1 \sigma_3 \\ & \sigma^2_2 & \rho_{23} \sigma_2 \sigma_3 & & \phi_{22} \sigma^2_2 & \phi_{23} \sigma_2 \sigma_3 \\ & & \sigma^2_3 & & & \phi_{33} \sigma^2_3 \\ \hline & & & \sigma^2_1 & \rho_{12} \sigma_1 \sigma_2 & \rho_{13} \sigma_1 \sigma_3 \\ & & & & \sigma^2_2 & \rho_{23} \sigma_2 \sigma_3 \\ & & & & & \sigma^2_3 \\ \end{array}\right]$

based on which I could easily estimate those cross-sibling correlations (the $\phi_{jj}$ values are the ICCs for the same item; the $\phi_{jj'}$ values are the ICCs for different items). However, as shown above, for some families, I have only two siblings, but for other families more than two. So, that makes me think that I need to get back to a variance-components type of model. However, the correlation between items may be negative, so I do not want to use a model that constraints the correlations to be positive.

Any ideas/suggestions of how I could approach this? Thanks in advance for any help!

score 1 · Answer 1 · answered Jun 10 '18 at 17:08

The package MCMCglmm can easily handle and estimate covariance structures and random effects. However it does use bayesian statistics which can be intimidating to new users. See the MCMCglmm Course Notes for a thorough guide to MCMCglmm, and chapter 5 in particular for this question. I absolutely recommend reading up on assessing model convergence and chain mixing before analysing data for real in MCMCglmm.

library(MCMCglmm)

MCMCglmm uses priors, this is an uninformative inverse wishart prior.

p<-list(G=list(
  G1=list(V=diag(2),nu=0.002)),
R=list(V=diag(2),nu=0.002))

Fit the model

m<-MCMCglmm(cbind(x,y)~trait-1,
#trait-1 gives each variable a separate intercept
        random=~us(trait):group,
#the random effect has a separate intercept for each variable but allows and estiamtes the covariance between them.
        rcov=~us(trait):units,
#Allows separate residual variance for each trait and estimates the covariance between them
        family=c("gaussian","gaussian"),prior=p,data=df)

In the model summary summary(m) the G structure describes the variance and covariance of the random intercepts. The R structure describes the observation level variance and covariance of intercept, which function as residuals in MCMCglmm.

If you are of a Bayesian persuasion you can get the entire posterior distribution of the co/variance terms m$VCV. Note that these are variances after accounting for the fixed effects.

simulate data

library(MASS)
n<-3000

#draws from a bivariate distribution
df<-data.frame(mvrnorm(n,mu=c(10,20),#the intercepts of x and y
                   Sigma=matrix(c(10,-3,-3,2),ncol=2)))
#the residual variance covariance of x and y


#assign random effect value
number_of_groups<-100
df$group<-rep(1:number_of_groups,length.out=n)
group_var<-data.frame(mvrnorm(number_of_groups, mu=c(0,0),Sigma=matrix(c(3,2,2,5),ncol=2)))
#the variance covariance matrix of the random effects. c(variance of x,
#covariance of x and y,covariance of x and y, variance of y)

#the variables x and y are the sum of the draws from the bivariate distribution and the random effect
df$x<-df$X1+group_var[df$group,1]
df$y<-df$X2+group_var[df$group,2]

Estimating the original co/variance of the random effects requires a large number of levels to the random effect. Instead your model will likely estimate the observed co/variances which can be calculated by cov(group_var)

5ayat · Answer 2 · 2016-05-14T11:36:46.673

If you're looking to get a "family effect" and an "item effect," we can think of there being random intercepts for both of these, and then model this with the 'lme4' package.

But, first we have to give each sibling a unique id, rather than a unique id within family.

Then for "the correlation between measurements taken on siblings within the same family for different items," we can specify something like:

mod<-lmer(value ~ (1|family)+(1|item), data=family)

This will give us a fixed effects intercept for all siblings, and then two random effects intercepts (with variance), for family and item.

Then, for "the correlation between measurements taken on siblings within the same family for the same item," we can do the same thing but just subset our data, so we have something like:

mod2<-lmer(value ~ (1|family), data=subset(family,item=="1"))

I think this might be an easier approach to your question. But, if you just want the ICC for item or family, the 'psych' package has an ICC() function -- just be cautious about how item and value are melted in your example data.

Update

Some of the below is new to me, but I enjoyed working it out. I’m really not familiar with the idea of negative intraclass correlation. Though I do see on Wikipedia that “early ICC defintions” allowed for a negative correlation with paired data. But as it’s most commonly used now, ICC is understood as the proportion of the total variance that is between-group variance. And this value is always positive. While Wikipedia may not be the most authoritative reference, this summary corresponds with how I’ve always seen ICC used:

An advantage of this ANOVA framework is that different groups can have different numbers of data values, which is difficult to handle using the earlier ICC statistics. Note also that this ICC is always non-negative, allowing it to be interpreted as the proportion of total variance that is “between groups.” This ICC can be generalized to allow for covariate effects, in which case the ICC is interpreted as capturing the within-class similarity of the covariate-adjusted data values.

That said, with data like you’ve given here, the inter-class correlation between items 1, 2, and 3 could very well be negative. And we can model this, but the proportion of the variance explained between groups will still be positive.

# load our data and lme4
library(lme4)    
## Loading required package: Matrix    

dat<-read.table("http://www.wvbauer.com/fam_sib_item.dat", header=TRUE)

So what percentage of the variance is between families, controlling also for between group variance between item-groups? We can use a random intercepts model like you suggested:

mod<-lmer(yijk ~ (1|family)+(1|item), data=dat)
summary(mod)    
## Linear mixed model fit by REML ['lmerMod']
## Formula: yijk ~ (1 | family) + (1 | item)
##    Data: dat
## 
## REML criterion at convergence: 4392.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6832 -0.6316  0.0015  0.6038  3.9801 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  family   (Intercept) 0.3415   0.5843  
##  item     (Intercept) 0.8767   0.9363  
##  Residual             4.2730   2.0671  
## Number of obs: 1008, groups:  family, 100; item, 3
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)    2.927      0.548   5.342

We calculate the ICC by getting the variance from the two random effects intercepts and from the residuals. We then calculate the square of family variance over the sum of the squares of all variances.

temp<-as.data.frame(VarCorr(mod))$vcov
temp.family<-(temp[1]^2)/(temp[1]^2+temp[2]^2+temp[3]^2)
temp.family    
## [1] 0.006090281

We can then do the same for the other two variance estimates:

# variance between item-groups
temp.items<-(temp[2]^2)/(temp[1]^2+temp[2]^2+temp[3]^2)
temp.items    
## [1] 0.04015039    
# variance unexplained by groups
temp.resid<-(temp[3]^2)/(temp[1]^2+temp[2]^2+temp[3]^2)
temp.resid    
## [1] 0.9537593    
# clearly then, these will sum to 1
temp.family+temp.items+temp.resid    
## [1] 1

These results suggest that very little of the total variance is explained by variance between families or between item-groups. But, as noted above, the inter-class correlation between items still could be negative. First let’s get our data in a wider format:

# not elegant but does the trick
dat2<-cbind(subset(dat,item==1),subset(dat,item==2)[,1],subset(dat,item==3)[,1])
names(dat2)<-c("item1","family","sibling","item","item2","item3")

Now we can model the correlation between, for example, item1 and item3 with a random intercept for family as before. But first, perhaps worth recalling that for a simple linear regression, the square root of the model’s r-squared is the same as the inter-class correlation coefficient (pearson’s r) for item1 and item2.

# a simple linear regression
mod2<-lm(item1~item3,data=dat2)
# extract pearson's r 
sqrt(summary(mod2)$r.squared)    
## [1] 0.6819125    
# check this 
cor(dat2$item1,dat2$item3)    
## [1] 0.6819125    
# yep, equal

# now, add random intercept to the model
mod3<-lmer(item1 ~ item3 + (1|family), data=dat2)
summary(mod3)    

## Linear mixed model fit by REML ['lmerMod']
## Formula: item1 ~ item3 + (1 | family)
##    Data: dat2
## 
## REML criterion at convergence: 1188.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3148 -0.5348 -0.0136  0.5724  3.2589 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  family   (Intercept) 0.686    0.8283  
##  Residual             1.519    1.2323  
## Number of obs: 336, groups:  family, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) -0.07777    0.15277  -0.509
## item3        0.52337    0.02775  18.863
## 
## Correlation of Fixed Effects:
##       (Intr)
## item3 -0.699

The relationship is between item1 and item3 is positive. But, just to check that we can get a negative correlation here, let’s manipulate our data:

# just going to multiply one column by -1
# to force this cor to be negative

dat2$neg.item3<-dat2$item3*-1
cor(dat2$item1, dat2$neg.item3)    
## [1] -0.6819125    

# now we have a negative relationship
# replace item3 with this manipulated value

mod4<-lmer(item1 ~ neg.item3 + (1|family), data=dat2)
summary(mod4)    

## Linear mixed model fit by REML ['lmerMod']
## Formula: item1 ~ neg.item3 + (1 | family)
##    Data: dat2
## 
## REML criterion at convergence: 1188.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3148 -0.5348 -0.0136  0.5724  3.2589 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  family   (Intercept) 0.686    0.8283  
##  Residual             1.519    1.2323  
## Number of obs: 336, groups:  family, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) -0.07777    0.15277  -0.509
## neg.item3   -0.52337    0.02775 -18.863
## 
## Correlation of Fixed Effects:
##           (Intr)
## neg.item3 0.699

So yes, the relationship between items can be negative. But if we look at the proportion of variance that’s between families in this relationship, i.e., ICC(family), that number will still be positive. As before:

temp2<-as.data.frame(VarCorr(mod4))$vcov
(temp2[1]^2)/(temp2[1]^2+temp2[2]^2)    
## [1] 0.1694989

So for the relationship between item1 and item3, about 17% of this variance is due to variance between families. And, we’ve still allowed for there to be a negative correlation between items.

Thanks for the suggestion, but I don't see how this would actually provide the correlations. I posted some data here: http://www.wvbauer.com/fam_sib_item.dat Note that I want to estimate 9 different correlations (plus the 3 item variances). — Wolfgang, May 12 '16 at 14:57
Then I suggest taking a look at the first-of-the-list of *Related Questions* [here](http://stats.stackexchange.com/questions/18088/intraclass-correlation-icc-for-an-interaction?rq=1). The answer in this post is very good if what you're ultimately looking for is the nine different ICC only. — 5ayat, May 12 '16 at 18:14
Thanks again, but still - how does that provide the nine ICCs? The model discussed there doesn't provide that. Also, it is a variance component model that won't allow for negative ICCs, but as I mentioned, I don't expect all ICCs to be positive. — Wolfgang, May 12 '16 at 18:22
I'm not familiar with the problem of negative ICC in a model like this - there are no such constraints here. But to calculate this correlation, when you look at the summary of your model with the above code, you have three variance estimates: family, item, and residual. So for example, as explained in other post, ICC(family), will be var(family)^2/(var(family)^2+var(item)^2)+var(residual)^2). In other words variance of your outcome squared over the sum of the variance-squared for the two random effects and the residual. Repeat for you 9 combinations of family and items. — 5ayat, May 13 '16 at 17:11
Apologies - I misspoke above. There's no possibility of a negative ICC in what I proposed. I'm not familiar with negative ICCs, as I've never encountered one. My experience with the topic has mostly been about ICC as it contributes to survey design effects where there is clustering. Is a negative ICC even possible? Because I don't see how you could get a negative number when you're dividing a squared value by the sum of other squared values... Unlike with Pearson's-r, I'm pretty sure ICC only varies between 0 and 1... but happy to be educated on the topic if I'm wrong. — 5ayat, May 13 '16 at 17:38
Which of the 9 different ICCs does `var(family)^2/(var(family)^2+var(item)^2)+var(residual)^2)` correspond to? And yes, ICCs can be negative. As I described at the beginning of my question, one can directly estimate the ICC with the `gls()` model, which allows for negative estimates. On the other hand, variance component models do not allow for negative estimates. — Wolfgang, May 13 '16 at 21:38

Intraclass Correlation Coefficients (ICC) with Multiple Variables

2 Answers2

Fit the model

simulate data

Update