Showing that 100 measurements for 5 subjects provide much less information than 5 measurements for 100 subjects

Question

At a conference I overheard the following statement:

100 measurements for 5 subjects provide much less information than 5 measurements for 100 subjects.

It's kind of obvious that this is true, but I was wondering how one could prove it mathematically... I think a linear mixed model could be used. However, I don't know much about the math used for estimating them (I just run lmer4 for LMMs and bmrs for GLMMs :) Could you show me an example where this is true? I'd prefer an answer with some formulas, than just some code in R. Feel free to assume a simple setting, such as for example linear mixed model with normally distributed random intercepts and slopes.

PS a math-based answer which doesn't involve LMMs would be ok too. I thought of LMMs because they seemed to me the natural tool to explain why less measures from more subjects are better than more measures from few subjects, but I may well be wrong.

+1. I guess the simplest setting would be to consider a task of estimating population mean $\mu$ where each subject has their own mean $a \sim \mathcal N(\mu, \sigma_a^2)$ and each measurement of this subject is distributed as $x \sim \mathcal N(a, \sigma^2)$. If we take $n$ measurements from each of the $m$ subjects, then what's the optimal way to set $n$ and $m$ given constant product $nm=N$. — amoeba, Jul 30 '17 at 18:34
"Optimal" in the sense of minimizing the variance of the sample mean of the $N$ acquired datapoints. — amoeba, Jul 30 '17 at 18:55
If I understand correctly, your model has two variances $\sigma_a^2$ and $\sigma^2$, and one population mean $\mu$ for a total of 3 parameters. Is that correct? Then I definitely agree that this is the simplest setting, and I'd love to see an answer. — DeltaIV, Jul 30 '17 at 19:22
Yes. But for your question we don't need to care about how to estimate the variances; your question (i.e. the quote in your question) is I believe only about estimating the global mean $\mu$ and it seems obvious that the best estimator is given by the grand mean $\bar x$ of all $N=nm$ points in the sample. The question then is: given $\mu$, $\sigma^2$, $\sigma^2_a$, $n$ and $m$, what is the variance of $\bar x$? If we know that, we will be able to minimize it with respect to $n$ given the $nm=N$ constraint. — amoeba, Jul 30 '17 at 19:35
@amoeba actually, I think it could be also interesting to see how/if the answer changes when we consider the (MLE) estimators for $\sigma_a^2$ & $\sigma^2$. In other words, which parameters of the model are better estimated by increasing the number of measurements per subject, and which by increasing the number of subjects, when keeping the total budget fixed? Intuitively the answer is obvious, but maybe it's not so simple to prove. Anyway, I'm ok with keeping it simple and considering only the variance of the grand mean. — DeltaIV, Jul 30 '17 at 19:50
I don't know how to derive any of that, but I agree that it seems obvious: to estimate error variance it would be best to have all $N$ measurements from one single subject; and to estimate subject variance it would (probably?) be best to have $N$ different subjects with 1 measurement each. It's not so clear about the mean though, but my intuition tells me that having $N$ subjects with 1 measurement each would be best too. I wonder if that's true... — amoeba, Jul 30 '17 at 20:01
Maybe something like that: The variance of sample per-subject means should be $\sigma^2_a + \sigma^2/n$, where the first term is the subject variance and the second is the variance of the estimate of each subject's mean. Then the variance of over-subjects mean (i.e. grand mean) will be $$(\sigma^2_a + \sigma^2/n)/m = \sigma^2_a/m + \sigma^2/(nm) = \sigma^2_a/m + \sigma^2/N = \sigma^2_a/m + \mathrm{const},$$ which is minimized when $m=N$. — amoeba, Jul 30 '17 at 20:30
Looks ok! We're assuming all iid samples in this model, right? This is coherent with the fact that we assumed both covariance matrices to be diagonal. — DeltaIV, Jul 31 '17 at 07:11

Jake Westfall · Accepted Answer · 2017-08-11T21:05:40.240

The short answer is that your conjecture is true when and only when there is a positive intra-class correlation in the data. Empirically speaking, most clustered datasets most of the time show a positive intra-class correlation, which means that in practice your conjecture is usually true. But if the intra-class correlation is 0, then the two cases you mentioned are equally informative. And if the intra-class correlation is negative, then it's actually less informative to take fewer measurements on more subjects; we would actually prefer (as far as reducing the variance of the parameter estimate is concerned) to take all our measurements on a single subject.

Statistically there are two perspectives from which we can think about this: a random-effects (or mixed) model, which you mention in your question, or a marginal model, which ends up being a bit more informative here.

Random-effects (mixed) model

Say we have a set of $n$ subjects from whom we've taken $m$ measurements each. Then a simple random-effects model of the $j$th measurement from the $i$th subject might be $$ y_{ij} = \beta + u_i + e_{ij}, $$ where $\beta$ is the fixed intercept, $u_i$ is the random subject effect (with variance $\sigma^2_u$), $e_{ij}$ is the observation-level error term (with variance $\sigma^2_e$), and the latter two random terms are independent.

In this model $\beta$ represents the population mean, and with a balanced dataset (i.e., an equal number of measurements from each subject), our best estimate is simply the sample mean. So if we take "more information" to mean a smaller variance for this estimate, then basically we want to know how the variance of the sample mean depends on $n$ and $m$. With a bit of algebra we can work out that $$ \begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + u_i + e_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_ju_i + \sum_i\sum_je_{ij}) \\ &= \frac{1}{n^2m^2}\Big(m^2\sum_i\text{var}(u_i) + \sum_i\sum_j\text{var}(e_{ij})\Big) \\ &= \frac{1}{n^2m^2}(nm^2\sigma^2_u + nm\sigma^2_e) \\ &= \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm}. \end{aligned} $$ Examining this expression, we can see that whenever there is any subject variance (i.e., $\sigma^2_u>0$), increasing the number of subjects ($n$) will make both of these terms smaller, while increasing the number of measurements per subject ($m$) will only make the second term smaller. (For a practical implication of this for designing multi-site replication projects, see this blog post I wrote a while ago.)

Now you wanted to know what happens when we increase or decrease $m$ or $n$ while holding constant the total number of observations. So for that we consider $nm$ to be a constant, so that the whole variance expression just looks like $$ \frac{\sigma^2_u}{n} + \text{constant}, $$ which is as small as possible when $n$ is as large as possible (up to a maximum of $n=nm$, in which case $m=1$, meaning we take a single measurement from each subject).

My short answer referred to the intra-class correlation, so where does that fit in? In this simple random-effects model the intra-class correlation is $$ \rho = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e} $$ (sketch of a derivation here). So we can write the variance equation above as $$ \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) = \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm} = \Big(\frac{\rho}{n} + \frac{1-\rho}{nm}\Big)(\sigma^2_u+\sigma^2_e) $$ This doesn't really add any insight to what we already saw above, but it does make us wonder: since the intra-class correlation is a bona fide correlation coefficient, and correlation coefficients can be negative, what would happen (and what would it mean) if the intra-class correlation were negative?

In the context of the random-effects model, a negative intra-class correlation doesn't really make sense, because it implies that the subject variance $\sigma^2_u$ is somehow negative (as we can see from the $\rho$ equation above, and as explained here and here)... but variances can't be negative! But this doesn't mean that the concept of a negative intra-class correlation doesn't make sense; it just means that the random-effects model doesn't have any way to express this concept, which is a failure of the model, not of the concept. To express this concept adequately we need to consider the marginal model.

Marginal model

For this same dataset we could consider a so-called marginal model of $y_{ij}$, $$ y_{ij} = \beta + e^*_{ij}, $$ where basically we've pushed the random subject effect $u_i$ from before into the error term $e_{ij}$ so that we have $e^*_{ij} = u_i + e_{ij}$. In the random-effects model we considered the two random terms $u_i$ and $e_{ij}$ to be i.i.d., but in the marginal model we instead consider $e^*_{ij}$ to follow a block-diagonal covariance matrix $\textbf{C}$ like $$\textbf{C}= \sigma^2\begin{bmatrix} \textbf{R} & 0& \cdots & 0\\ 0& \textbf{R} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots &\textbf{R}\\ \end{bmatrix}, \textbf{R}= \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \cdots &1\\ \end{bmatrix} $$ In words, this means that under the marginal model we simply consider $\rho$ to be the expected correlation between two $e^*$s from the same subject (we assume the correlation across subjects is 0). When $\rho$ is positive, two observations drawn from the same subject tend to be more similar (closer together), on average, than two observations drawn randomly from the dataset while ignoring the clustering due to subjects. When $\rho$ is negative, two observations drawn from the same subject tend to be less similar (further apart), on average, than two observations drawn completely at random. (More information about this interpretation in the question/answers here.)

So now when we look at the equation for the variance of the sample mean under the marginal model, we have $$ \begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + e^*_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_je^*_{ij}) \\ &= \frac{1}{n^2m^2}\Big(n\big(m\sigma^2 + (m^2-m)\rho\sigma^2\big)\Big) \\ &= \frac{\sigma^2\big(1+(m-1)\rho\big)}{nm} \\ &= \Big(\frac{\rho}{n}+\frac{1-\rho}{nm}\Big)\sigma^2, \end{aligned} $$ which is the same variance expression we derived above for the random-effects model, just with $\sigma^2_e+\sigma^2_u=\sigma^2$, which is consistent with our note above that $e^*_{ij} = u_i + e_{ij}$. The advantage of this (statistically equivalent) perspective is that here we can think about a negative intra-class correlation without needing to invoke any weird concepts like a negative subject variance. Negative intra-class correlations just fit naturally in this framework.

(BTW, just a quick aside to point out that the second-to-last line of the derivation above implies that we must have $\rho \ge -1/(m-1)$, or else the whole equation is negative, but variances can't be negative! So there is a lower bound on the intra-class correlation that depends on how many measurements we have per cluster. For $m=2$ (i.e., we measure each subject twice), the intra-class correlation can go all the way down to $\rho=-1$; for $m=3$ it can only go down to $\rho=-1/2$; and so on. Fun fact!)

So finally, once again considering the total number of observations $nm$ to be a constant, we see that the second-to-last line of the derivation above just looks like $$ \big(1+(m-1)\rho\big) \times \text{positive constant}. $$ So when $\rho>0$, having $m$ as small as possible (so that we take fewer measurements of more subjects--in the limit, 1 measurement of each subject) makes the variance of the estimate as small as possible. But when $\rho<0$, we actually want $m$ to be as large as possible (so that, in the limit, we take all $nm$ measurements from a single subject) in order to make the variance as small as possible. And when $\rho=0$, the variance of the estimate is just a constant, so our allocation of $m$ and $n$ doesn't matter.

+1. Great answer. I have to admit that the second part, about $\rho<0$, is quite unintuitive: even with a huge (or infinite) total number $nm$ of observations the best we can do is to allocate all observations to one single subject, meaning that the standard error of the mean will be $\sigma_u$ and it's not possible *in principle* to reduce it any further. This is just so weird! True $\beta$ remains unknowable, whatever resources one puts into measuring it. Is this interpretation correct? — amoeba, Aug 01 '17 at 08:28
Ah, no. The above is not correct because as $m$ increases to infinity, $\rho$ cannot stay negative and has to approach zero (corresponding to zero subject variance). Hmm. This negative correlation is a funny thing: it's not really a parameter of the generative model because it's constrained by the sample size (whereas one would normally expect a generative model to be able to generate any number of observations, whatever the parameters are). I am not quite sure what is the proper way to think about it. — amoeba, Aug 01 '17 at 09:17
Woah impressive answer! (+1). I will need some time to read through it and understand it. Right now I have two questions: 1. you note that with a balanced dataset, our best estimate of the population mean is simply the sample mean. What's the best estimate if the dataset is unbalanced? I can ask it as a separate question if you prefer, but even just a link to a good answer, or to some material, would be great. 2. you say that the mixed effects model is not able to represent negative intra-class correlation. This is true because we assumed the covariance matrix of the random effects to ... — DeltaIV, Aug 01 '17 at 09:28
...diagonal. But suppose we assume a more general mixed effect model, where $\Sigma$ is not diagonal. For example, it's estimated from data as done [here](https://vuorre.netlify.com/post/2017/correlated-psychological-variables-uncertainty-and-bayesian-estimation/#update), even if the link actually considers GLMMs which are a needless complication here). Then, would the mixed effect model still be unable to reproduce negative intraclass correlation or not? — DeltaIV, Aug 01 '17 at 09:33
@DeltaIV What is "the covariance matrix of the random effects" in this case? In the mixed model written by Jake above, there is only one random effect and so there is no "covariance matrix" really, but just one number: $\sigma^2_u$. What $\Sigma$ are you referring to? — amoeba, Aug 01 '17 at 09:37
@amoeba you're right, the covariance matrix would only make sense if I had at least a covariate in the model. No covariates $\Rightarrow$ only one random effect $\Rightarrow$ no $\Sigma$ to speak of. I feel embarassed now :P Probably I just don't have a an intuitive understanding of what this intra-class correlation is. BTW, if you have pointers for the other part of my comment (best estimate of sample mean when dataset is unbalanced), I'm all ears :) — DeltaIV, Aug 01 '17 at 09:46
@DeltaIV If the dataset is unbalanced across subjects, then there is a question of how to weight the individual subject means when combining them to form the estimate. Taking the simple mean would be equivalent to weighting each subject directly in proportion to their number of measurements. Which is not a crazy idea, but it turns out that the MLE of $\beta$ is a weighted average of the subject means where the weights depend on the numbers of measurements *and* the estimated subject variance, $\hat{\sigma^2_u}$. I can't remember the exact expression, but if you Google around you can find it. — Jake Westfall, Aug 01 '17 at 14:51
@amoeba I think you can perhaps still think of $\rho$ as a parameter of the generative model, because I'm guessing (but haven't checked) that the constraint on $\rho$ is equivalent to the constraint that $\textbf{C}$ must be positive semidefinite. So all it really says is that you have to use a valid covariance matrix. But it's like the value $m$ is sort of "hard-coded" as a parameter of the generative model, unlike $n$ which we can generate to be as high as we want for any given set of parameters. This might all be semantics, but it seems to me to still make sense to call $\rho$ a parameter — Jake Westfall, Aug 01 '17 at 15:05
@JakeWestfall I tried to Google for estimator of sample mean when dataset is unbalanced, but couldn't find anything. All "unbalanced" links I found, refer to Machine Learning two-class classification tasks where one class is much more common than the other in the training sample. I will ask a new question here, I hope you'll take a look at it ;-) — DeltaIV, Aug 02 '17 at 09:08
@DeltaIV Well, the general principle is https://en.wikipedia.org/wiki/Inverse-variance_weighting, and the variance of each subject's sample mean is given by $\sigma^2_u + \sigma^2_e/m_i$ (that's why Jake wrote above that the weights have to depend on the estimate of between-subject variance). The estimate of within-subject variance is given by the variance of the pooled within-subject deviations, the estimate of between-subject variance is the variance of subjects' means, and using all that one can compute the weights. (But I am not sure if this is 100% equivalent to what lmer will do.) — amoeba, Aug 02 '17 at 10:35
Jake, yes, it's exactly this hard-coding of $m$ that was bothering me. If this is "sample size" then it cannot be a parameter of the underlying system. My current thinking is that negative $\rho$ should actually indicate that there is another within-subject factor that is ignored/unknown to us. E.g. it could be pre & post of some intervention and the difference between them is so large that the measurements are negatively correlated. But this would mean that $m$ is not really a sample size, but the number of levels of this unknown factor, and *that* can certainly be hard coded... — amoeba, Aug 02 '17 at 10:51
@amoeba that's ok, I don't need it to be exactly equivalent to what `lmer` does. I just wanted to be able to give a "manual" estimate without having to blindly rely on R. This question comes out surprisingly often in an industrial context. — DeltaIV, Aug 02 '17 at 11:42
@amoeba $m$ as the number of levels for a missing, unobserved factor is a very interesting idea. It's interesting to think about how the fully observed model (including the missing factor) could be parameterized so that removing that factor would result in this model with the negative intra-class correlation — Jake Westfall, Aug 02 '17 at 13:49

Showing that 100 measurements for 5 subjects provide much less information than 5 measurements for 100 subjects

1 Answers1

Random-effects (mixed) model

Marginal model