2

I have a data set with 1002 subjects and 501 families, and I would like to estimate the % variance explained by "family" on a dependent variable.

To make it easier, lets say I want to find the % variance explained by "Batch" in the follow data set:

Test data

set.seed(1231)
batch <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
yield <- c(34,36,35,68,71,72,75,87,97,72,74,73,79,81,79)
age <- rnorm(15, mean = 20, sd = 1)
bmi <- rnorm(15, mean = 32, sd = 1.2)
mydata <- data.frame(batch, yield, age, bmi)
mydata$batch <- factor(mydata$batch)
str(mydata)

Approach one:

I regressed "yield" against covariates "Age" and "bmi", then extract the residuals. Using the residuals, I estimated the R$^2$ of "Batch"

Approach two:

I could estimate the R$^2$ of "Batch" with a mixed model, using "Batch" as a random effect. (The original data has 501 families and thus I want to make it as a random effect)

My question is:

I do not able to get similar R$^2$ of "Batch" from the 2 different approaches above (0.6915 vs 0.937).

I wonder anyone could give me a help here? I thought about this for 2 days and still have no ideas. Would be very gladly for the help. :-)

R code

# Approach one: Extract residual and estimate R$^2$ of "Batch"
lmaovresid <- lm(yield ~ age + bmi, data = mydata)
mydata$Yresid <- lmaovresid$residuals
lmaov2 <- lm(Yresid ~ batch, data = mydata)
anova(lmaov2)

# Response: yieldresid
# Df Sum Sq Mean Sq F value  Pr(>F)  
# batch      4 2535.9  633.97  5.6048 0.01245 *
# Residuals 10 1131.1  113.11     

# Therefore, R$^2$ of "Batch" = 2535.9/(2535.9+1131.11) = 0.6915

# Approach two: mixed model using "batch" as random effect
library(lme4)
mixed2 <- lmer(yield ~ age + bmi + (1|batch), data = mydata)
summary(mixed2)

# Random effects:
# Groups   Name        Variance Std.Dev.
# batch    (Intercept) 391.95   19.798  
# Residual              31.52    5.614  
# Number of obs: 15, groups:  batch, 5

## So: 
## MS of "Batch" = 31.52 + 15/5*(391.95) = 1207.37
## SS of "Batch" = 1207.37 * 4 = 4829.48
## SS of residual = 31.52 * 10 = 315.2 

anova(mixed2)
# Analysis of Variance Table
# Df Sum Sq Mean Sq F value
# age  1 2.7085  2.7085  0.0859
# bmi  1 1.4369  1.4369  0.0456

## Therefore, R^2 of "Batch" 
## = 4829.48/ (4829.48+315.2+2.7085+1.4369) = 0.937
Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
Jake
  • 21
  • 1
  • 3
    In similar applications I have used: Xu R. (2003) *Measuring explained variation in linear mixed effects models.* and the results where easily interpretable and accepted by all parties concerned. Whether or not one should base her decision of assessing the validity of a model on a single number though is questionable. – usεr11852 Apr 14 '17 at 23:01
  • Thanks. My primary concern is to estimate the r^2 of "family". Based on what I know, 2 approaches should be somewhat equivalent, and now I tend to believe I missed something in the mixed model approach. – Jake Apr 15 '17 at 14:28
  • 2
    There is no $R^2$ of a particular variable. The inclusion or exclusion of a particular variable might (and often does) change the explanatory power of other variables. You can say that it increases the $R^2$ of a model that does not include it by $X$ amount but that is something different. – usεr11852 Apr 15 '17 at 14:47
  • An informative thread is [Calculating R2 in mixed models using Nakagawa & Schielzeth's (2013) R2glmm method](https://stats.stackexchange.com/questions/111150/calculating-r2-in-mixed-models-using-nakagawa-schielzeths-2013-r2glmm-me) – Firebug Apr 16 '17 at 20:06
  • @Firebug: Good addition! (I just +1 Robert's answer, how that one didn't get any votes is beyond me.) – usεr11852 Apr 16 '17 at 20:07
  • @usεr11852 I just put the thread title in the hyperlink, we posted almost at the same time haha – Firebug Apr 16 '17 at 20:20
  • Thank you all (including Placidia) for the useful references and discussion! Based on my understanding, to get the r^2 solely by the random effect of a model using the Nakagawa method, it is simply R2c - R2m. correct? – Jake Apr 18 '17 at 13:24

1 Answers1

2

The usual $R^2$ looks at how much of the variation in the data is explained by the model, as opposed to the error term. However a random effect is not, strictly speaking, part of the model. It is part of the error. So one way to look at this would be to estimate the random effect error and the pure error (residual error) and see how they compare.

The output for a random model generally gives estimates for these. In your output, it's there under "random effects". You have batch, at around 392 and Error, at around 31.

Placidia
  • 13,501
  • 6
  • 33
  • 62
  • 2
    Ompph... As you say it is a "strictly speaking" definition - regarding a random effect as a nuisance parameter is (to my opinion) over-simplifying things. Aside the Xu paper mentioned above, one could try Nakagawa and Schielzeth (2013) *A general and simple method for obtaining $R^2$ from generalized linear mixed-effects models* for something more methodologically inclusive. – usεr11852 Apr 16 '17 at 20:06
  • You make a point - and that's a nice reference. But I think it's important to remember how different random effects are from fixed effects. The concept rests on latent values (the actual effects), that typically grow in number with the sample size. And they are supposed to be normally distributed. They sort of are nuisance parameters. – Placidia Apr 16 '17 at 20:10
  • I don't disagree; you raise a valid point. I wanted to point out this caveat though because one can be easily lead to over-simplifying assertions using the nuisance parameters working definition. For instance, going ahead and simply report $R^2$ through the OLS model because "all else is error after all" can be a pessimistic estimate of the model's explanatory variable. (In terms of $R^2$ at least) – usεr11852 Apr 16 '17 at 20:18