Using correlation for "explained variation"

Question

I present using dummy data for an example, see below for R code to replicate data.

Imagine that I have collected data on how much mail 500 people receive. In my survey there are 250 men and 250 women, so $n$ = 500. I asked each respondent to categorise & count the mail they received, over the course of one week, in to one of three classes; junk, bills, and personal.

What I want to test is the amount of variation in mail received as explained by each of the three classes. Assuming that all mail received falls in to one of the three classes, I would think that I could generate a variance-covariance matrix (using MCMCglmm in R) and then derive correlations. The correlations would then act as estimates of how much variance is explained by each class of mail.

$\rho_{ij} = \frac{cov_{total, class_i}}{\sqrt{\sigma^2_{total} * \sigma^2_{class_i}}}$

$Percentage$ $variance$ $explained = 100 * \rho_{ij}$

Where $\rho_{ij}$ is the correlation between total received and the category of interest. For example, I might show that the correlation between total mail received and junk mail received, as defined above, is 0.7, thus conclude that the amount of junk mail one receives explains 70% of the variation in mail received. Further, I might sum together junk and bills in to a new class called "businesses" which might explain 85% of the variance is the correlation, as defined above, is 0.85.

Would this be an appropriate approach or is it a terribly flawed idea? I've read that using correlation coefficients as a way to measure "explained variation" can be controversial, but I think in this case it is appropriate because the value total can only be determined by the 3 constituents, thus there are no more variables that can explain variation in the total.

# Clear workspace
rm(list = ls())
set.seed(1123)

# Number of people surveyed
m = 250 # males
f = 250 # females
n = m + f

# Distributions
junk = rpois(n, 3)
personal = rpois(n, 2)
bills = rpois(n, 3)

# Dataframe
mail = data.frame(as.factor(1:n), as.factor(rep(c("m","f"), each = m)), 
            sample(junk, replace=TRUE, size = n), sample(bills, replace=TRUE, size = n), sample(personal, replace=TRUE, size = n))
colnames(mail) = c("person", "sex", "junk", "bills", "personal")
mail$total = apply(mail[,3:5],1,sum)
    mail$junk = as.factor(mail$junk); mail$bills = as.factor(mail$bills); mail$personal = as.factor(mail$personal); mail$total = as.factor(mail$total)

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

There are already several approaches to computing "variance explained" ($R^2$) with linear mixed models (check here, here or here and follow the sources) so it would be probably good to start with reviewing those ideas before discovering your own approach. Among authors who have done some research on this check Xu (2003), Edwards et al. (2008), Nakagawa and Schielzeth (2013) or Snijders and Bosker (1994).

Using correlation as "variance explained" is similar to the general idea of $R^2$ that for linear regression equals correlation squared (see here), however this works for linear regression, not "in general" and it is correlation squared.

Also ICC could be understood as "variance explained" by random effects

$$ICC_\alpha = \frac{\sigma^2_\alpha}{\sigma^2_\alpha + \sigma^2_\gamma + ... + \sigma^2_\varepsilon}$$

where $\sigma^2_\alpha$ and $\sigma^2_\gamma$ are variance of random effects and $\sigma^2_\varepsilon$ is residual variance (to learn more on ICC see for example Snijders and Bosker, 2012). So in here you also use variances estimated from the model.

So your idea does not seem totally bad however you should compare it with other ideas in this field and the papers I linked that review pros and cons of different ways of thinking about "variance explained" in LMM's.

Snijders, T.A.B. and Bosker, R.J. (2012). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage Publishers.

There are serious concerns with looking for a "variance explained" measure for (G)LMMs -- none of the generalizations of the classic $R^2$ formulas to the (G)LMM case have all the properites that make $R^2$ so useful. See for example Doug Bates' comments [here](https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q1/003363.html) and [here](https://stat.ethz.ch/pipermail/r-sig-mixed-models/2014q4/023007.html). — Livius, Jan 23 '15 at 15:02
Yes, its not the same as true $R^2$, but the fact is that there are multiple attempts to have something similar. OP asks about "variance explained" and those are few attempts to it - however I agree that the critical comments are valid and should be take into consideration. — Tim, Jan 23 '15 at 16:30
Thanks for the answer, I've spent the weekend reading the Nakagawa/Schielzeth paper and have modified my question to reflect some of my newer struggles - I also work with Schielzeth so will email him for some guidance! — rg255, Jan 26 '15 at 15:18

Using correlation for "explained variation"

1 Answers1