11

I need to obtain some sort of "average" among a list of variances, but have trouble coming up with a reasonable solution. There is an interesting discussion about the differences among the three Pythagorean means (arithmetic, geometric, and harmonic) in this thread; however, I still don't feel any of them would be a good candidate. Any suggestions?

P.S. Some context - These variances are sample variances from $n$ subjects, each of whom went through the same experiment design with roughly the same sample size $k$. In other words, there are $n$ sampling variances $\sigma_1^2$, $\sigma_2^2$, ..., $\sigma_n^2$, corresponding to those $n$ subjects. A meta analysis has been already performed at the population level. The reason I need to obtain some kind of "average" or "summarized" sample variance is that I want to use it to calculate an index such as ICC after the meta analysis.

P.P.S. To keep the discussion more concrete, let me explain the issue with the following example in R:

library(metafor)
dat <- get(data(dat.konstantopoulos2011))
dat$district <- as.factor(dat$district)
dat$school <- as.factor(dat$school)

In the dataset there is a variance associated with each school's performance score:

str(dat)
Classes ‘escalc’ and 'data.frame':  56 obs. of  6 variables:
 $ district: Factor w/ 11 levels "11","12","18",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ school  : Factor w/ 11 levels "1","2","3","4",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ year    : int  1976 1976 1976 1976 1989 1989 1989 1989 1994 1994 ...
 $ yi      : atomic  -0.18 -0.22 0.23 -0.3 0.13 -0.26 0.19 0.32 0.45 0.38 ...
 $ vi      : num  0.118 0.118 0.144 0.144 0.014 0.014 0.015 0.024 0.023 0.043 ...

Suppose that we perform a meta analysis with a hierarchical or mixed-effects model:

$y_{ij} = a + \alpha_i + \beta_j + \epsilon_{ij}$

where $\alpha_i$ and $\beta_j$ are the random effects for the $i$th school and $j$th district, respectively, and $\epsilon_{ij}$ is the measurement error with a known Gaussian distribution $N(0,v_{ij})$. This model can be analyzed as below:

(fm <- rma.mv(yi, vi, random = list(~1 | district, ~1 | school), data=dat)) 

rendering the following variance estimates for the two variance components:

Multivariate Meta-Analysis Model (k = 56; method: REML)
Variance Components: 

            estim    sqrt  nlvls  fixed    factor
sigma^2.1  0.0814  0.2853     11     no  district
sigma^2.2  0.0010  0.0308     11     no    school

The two variances in the result, sigma^2.1 and sigma^2.2, correspond to the two random-effects variables (district and school).

I would like to compute the ICC for district, and that is why I wanted to obtain a summarized variance in the first place for those individual variances, $v_{ij}$, of the measurement term $\epsilon_{ij}$. Since the total variance is

$Var(y_{ij})= Var(\alpha_i + \beta_j + \epsilon_{ij}) = \sigma_1^2 + \sigma_2^2+v_{ij}$

my original (and simple) approach was to use just the arithmetic mean:

$\frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2+mean(v_{ij})}$

but I am not sure if the arithmetic mean, $mean(v_{ij})$, is appropriate in this context.

bluepole
  • 2,376
  • 3
  • 25
  • 34
  • Can you provide some context? Where did the variances come from, and why do you want to average them? – GeoMatt22 May 06 '17 at 00:11
  • 4
    Context is everything here. Are these theoretical variances (moments of distributions), or sample variances? If they are sample variances, what is the relation between the samples? Do they come from the same population? If yes, do you have available the size of each sample? If the samples do not come from the same population, how do you justify averaging over the variances? – Alecos Papadopoulos May 06 '17 at 00:47
  • @`Alecos Papadopoulos` @GeoMatt22 Per your comments, I just added some information to my original question. Thanks! – bluepole May 06 '17 at 09:23
  • Can you pool all the values then calculate the variance from this big pool? – SmallChess May 06 '17 at 11:26
  • @SmallChess Unfortunately direct pooling is not an option in this case since it's a meta analysis. – bluepole May 06 '17 at 11:37
  • When you say it is a meta-analysis what exactly are you trying to meta-analyse? Do you have $n$ primary studies each of which gives rise to a variance based on $k$ observations within that study? – mdewey May 09 '17 at 14:19
  • @mdeway I just added a little bit more detail about the meta analysis in the original post. – bluepole May 09 '17 at 14:42
  • 2
    Hierarchical modeling is a very flexible answer. This blog post on the eight schools is a good start. http://andrewgelman.com/2014/01/21/everything-need-know-bayesian-statistics-learned-eight-schools/ Gelman et al., *Bayesian Data Analysis* is a great place to get more information. – Sycorax May 09 '17 at 14:56
  • @Sycorax I'm familiar with that blog post. However, my specific goal here is to define an index such as ICC based on the "average" sampling variance among the subjects. – bluepole May 09 '17 at 15:23
  • 2
    Possible duplicate of [How to 'sum' a standard deviation?](https://stats.stackexchange.com/questions/25848/how-to-sum-a-standard-deviation) – Firebug May 09 '17 at 15:31
  • It's not clear to me how the blog post is *not* exactly the answer you need -- sample some variances from the posterior and compute the ICC for each. Then you have a distribution over ICC statistics. – Sycorax May 09 '17 at 15:44
  • 1
    Is this an XY problem? Do you want to know how to average variances... Or do you want to know how to calculate an ICC for a meta-analysis? – Mark White May 11 '17 at 04:18
  • @MarkWhite The latter: I would like to calculate an ICC after the meta analysis. Since each subject has a variance, the question is: how to "average" those $n$ variances so that I could plug it into the ICC formula? – bluepole May 11 '17 at 14:27
  • Does the documentation on the package author's website http://www.metafor-project.org/doku.php/analyses:konstantopoulos2011 not answer your question? Search the page for ICC. It even uses the same example as you are using. – mdewey May 17 '17 at 16:10
  • @mdewey Thanks for the information! I knew that website before, but the ICC formula there does not seem to take the total variance into consideration: $Var(y_{ij}) = Var(\alpha_i + \beta_j + \epsilon_{ij}) = \sigma_1^2 + \sigma_2^2+v_{ij}$ – bluepole May 17 '17 at 16:55
  • 1
    In that case does this https://stats.stackexchange.com/questions/187197/computing-heterogeneity-assigned-to-random-factors-in-meta-analysis post help? – mdewey May 17 '17 at 17:43
  • @mdewey Yes, that's something I should look into further. Thanks a lot for the pointer! – bluepole May 17 '17 at 19:06

1 Answers1

8

Expanding the comments that you got, answer for the question in your title is already given in How to 'sum' a standard deviation? thread, and reads as follows: to get average standard deviation, first take average of variances and then take square root of it.

At face value this approach is valid, but it ignores the hierarchical nature of your data. Similar example is discussed in chapter 5 of Bayesian Data Analysis by Andrew Gelman et al (see also here), who show that it is actually wiser to use hierarchical models that rely on pooled estimates. In your case you have $n \times k$ observations, for $n$ subjects in $k$ treatments and I guess it can be assumed that there is some kind of similarity between results obtained by each subject and between each treatment. This already suggests a hierarchical model with crossed upper-level effects for treatments and for subjects. By using such model you would account for both sources of variation.

Notice that modern formulations of ICC in fact define it in terms of mixed-effects models of the kind as described above, so employing such model solves multiple problems for you and it is often the recommended approach to meta-analysis (but notice that ICC can be misleading).


Regarding your edit, if your model is

$$ y_{ij} = a + \alpha_i + \beta_j + \epsilon_{ij} $$

then $\alpha_i \sim \mathcal{N}(\mu_\alpha, \sigma^2_\alpha)$, $\beta_j \sim \mathcal{N}(\mu_\beta, \sigma^2_\beta)$ and $\epsilon_{ij} \sim \mathcal{N}(0, \sigma^2_\epsilon)$, so your ICC is

$$ \mathrm{ICC}_\alpha = \frac{\sigma_\alpha^2}{\sigma_\alpha^2 + \sigma_\beta^2 + \sigma_\epsilon^2} $$

The mean of errors does not come into the equation at any point. What comes to the equation is variance of each of the random effects $\alpha,\beta$ and global "noise" $\epsilon$. The idea is to estimate the share of variance taken by $\alpha$, i.e. how much of the total variance does it account for. This is how ICC was defined by its creator Ronald A. Fisher (1966) in Statistical Methods for Research Workers:

(...) the intraclass correlation will be merely the fraction of the total variance due to that cause which observations in the same class have in common.

So the numerator in the ICC formula is the variance of the effect of interest and the denominator is the total variance. Notice that mean of variances has nothing to do with total variance (sum of variances), so unless I misunderstand something, I can't see why the mean is of your interest in here.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • I really appreciate the answer and all the comments above! I just added another postscript in the original post to further clarify the issue. I have to admit that I'm not so familiar with the Bayesian approach. If the issue can be better characterized under the Bayesian paradigm, please elaborate a little bit more with the example dataset I just presented in the postscript. Thanks! – bluepole May 12 '17 at 17:41
  • @bluepole You do not need a Bayesian model. Traditional mixed-effects model would work just fine. Bayesian models are generally more flexible for such problems. – Tim May 12 '17 at 19:00
  • So, for the added example dataset in my original post, do you think that the arithmetic mean is reasonable in the context? – bluepole May 12 '17 at 19:11
  • One thing that is misstated in your addendum is that $\epsilon_{ij}$ follows $N(0, \sigma_{ij}^2)$, not $N(0, \sigma_\epsilon^2)$, where $\sigma_{ij}^2$ is known. So, I don't see how your $\sigma_\epsilon^2$ is estimated. And my original question remains. – bluepole May 16 '17 at 09:24
  • In my description I've only mentioned one model with the assumption $\epsilon_{ij} \sim N(0,\sigma_{ij}^2)$, where $\sigma_{ij}^2$ is known. Could you elaborate a little bit more as to how $\sum_i\sigma_{ij}^2/\sum_{ij} \sigma_{ij}^2$ is related to the ICC formula? Thanks! – bluepole May 16 '17 at 09:58
  • What do you mean by "overdisposed"? I've already included some R code in my description to solve the system. – bluepole May 16 '17 at 10:06
  • @bluepole I am afraid I don't fully understand your questions (you actually asked few different ones after the edits). I answered your initial question and the PS. However your PS was not totally clear since, after re-reading it, I don't understand what kind of ICC do you want to estimate. Your PPS makes it even more unclear and needs from your reader to be familiar with metafor package (I am not). Nonetheless, the model you suggest in metafor code seems to be the one that I described. The model specification you describe in your formulas seem to be mixing fixed effects model with distinct ... – Tim May 17 '17 at 07:21
  • ...variances, with mixed-effects model. The default that you use (struct="CS") estimates model with single variance component (see the docs), i.e. the model that I described. It is not clear however how and why do you want to mix the output from the model with calculated by hand variance component and if/why this should be valid. Nonetheless, the answer still applies to the general questions you asked. As about the edits, it is unclear what (and why) do you mean. – Tim May 17 '17 at 07:28
  • Thanks for the follow-up! I just added a little bit clarification in the PPS regarding the total variance, and hope that explains a little better about my question. – bluepole May 17 '17 at 15:48
  • @bluepole as I said, I do not fully understand what do you mean by your edit. The model that you describe returns all the variance components that are part of it, so if you include in your calculation some arbitrary, calculated by hard variance component, then you are counting something twice. I can't see why do you want to do your manual calculations. Either you use mixed-effects model and use it's estimates for calculating ICC, or not. – Tim May 17 '17 at 18:44