7

Context: My problem relates to estimating effect sizes, such as Cohen's d, when looking at a subset of the population defined by a cut-off threshold. This effect size is the difference in two population means divided by the (assumed equal) population standard deviation.

Suppose there is a sample from a population with a variable $Y$ with "true" values $Y_{i0}$ that will be measured with error at two time points, $t_1$ and $t_2$, giving measurements $Y_{i1} = Y_{i0} + \epsilon_{i1}$, $Y_{i2} = Y_{i0} + \epsilon_{i2}$. At time $t_1$ we define a subset $J$ of the population by "$i \in J$ if $Y_{i1} > a$" for some fixed $a$. The objective is to estimate the variance of the subset at $t_2$, $V[Y_{2j}|j \in J]$ (or equivalently, the variance of $Y$ in the subset measured at any time other than $t_1$). We cannot use the subset's estimated variance at $t_1$ because the variance at $t_2$ will be larger.

Example code showing that the standard deviation of the subset at $t_2$ is greater than the standard deviation at $t_1$.

set.seed(1)
N <- 1000
Y0 <- rnorm(N,mean=0,sd=1)
Y1 <- Y0 + rnorm(N,mean=0,sd=0.5)
Y2 <- Y0 + rnorm(N,mean=0,sd=0.5)
indx <- Y1 > 1
sd(Y1[indx])
# [1] 0.6007802
sd(Y2[indx])
# [1] 0.8145581

Does this phenomenon, the variance of a thresholded subset increasing upon re-measurement, have a name? Can anyone share any references to help understand it either generally or in the specific context of effect sizes?

David Luke Thiessen
  • 1,232
  • 2
  • 15

1 Answers1

4

This is a kind of regression toward the mean applied, in this specific case, to the variance or standard deviation. Regression toward the mean is observed when selecting subjects based on a very high or very low value and observing that subsequent measurements will be closer to the average.

Regression toward the mean can be observed, for instance, if you compare the best students in a class and check their trajectories over a period of time (there is many other scenarios possible!). At T1, you choose the best students based on a measure Y1, like, for instance, indx <- Y1 > 1, then at T2, we should see a trend toward the population parameters (in this example : $\mu = 0$, $\sigma =\sqrt{1.25} $)

set.seed(1)
N <- 1000
Y0 <- rnorm(N,mean=0,sd=1)
Y1 <- Y0 + rnorm(N,mean=0,sd=0.5)
Y2 <- Y0 + rnorm(N,mean=0,sd=0.5)
indx <- Y1 > 1
mean(Y1[indx])
#1.685
sd(Y1[indx])
#0.6007802
mean(Y2[indx])
#1.357769 # a decrease toward the population mean = 0
sd(Y2[indx])
#0.8145581 # an increase toward the population standard deviation = sqrt(1.25)

As expected.

To make the matter more obvious, we could use the best students at T1 and T2 for a third measure T3, like

Y3 <- Y0 + rnorm(N,mean=0,sd=0.5)
indx2 <- Y2[indx]>1
mean(Y3[indx2])
#-0.02015669
sd(Y3[indx2])
#1.123317

which are even closer to the population parameters.

I don't have a specific references for the regression toward the mean in the context of effect sizes, but this phenomenon is covered extensively in many ressources. I do not see why regression toward the mean would be different in the context of effect sizes. The Wikipedia page can be very helpful and has many references. Stigler (2002) has an interesting and very accessible to most readers chapter on the topic.

Stigler, S. M. (2002). Statistics on the table. The theory of statistical concepts and methods. Harvard University Press.

POC
  • 346
  • 1
  • 8
  • 23