How to adjust estimates to account for regression to the mean?

Question

Let us consider an example where we have a number of runners and an estimate of speed in mph for each runner. The estimate for each runner may be based on an equal or unequal number of independent and identical experiments where that runner is timed running a constant distance. We assume each runner's time is ditributed randomly according to some distribution. If we wish to identify the runner with the highest expected value of speed, we might select the runner in our dataset who has the highest average speed.

Even though that may be a fine procedure for finding the best runner, that runner's average speed, calculated from the data, will be a biased estimate for their expected speed because we chose the runner based on the fact that they had the highest speed in the dataset.

My question is: Is it possible to adjust our estimate for the runner's expected speed by accounting for the procedure by which we selected the runner? Eg., if we select the runner with the highest average speed, the adjusted estimate will be lower than their calculated average, and vice-versa. I'm interested in a solution even if it requires introducing additional assumptions.

Measures of central tendency are more than classic mean. Median, geometric, harmonic, median of medians. They bring their own “physics”, and each can yield different insight. — EngrStudent, Mar 19 '21 at 15:49
@EngrStudent I am aware of other measures of central tendency, but not sure how they would solve the problem. My question could just as well be posed in terms of finding the runner with the maximum median instead of expected value. — Ryan Volpi, Mar 19 '21 at 15:59
In optimal control systems there are infinite measures of goodness, and if you say something is "optimal" that allows extraction of measure of goodness from it. Bias, in statistical sense, means offset of central tendency between actual and ideal. There are an infinite number of ways to contrive that, some of them are worthless engineering tasks, and some quite useful. If you say the ideal is the sample mean, then the difference between the sample mean and sample mean is zero, you cooked the books and achieved infinite perfection. You need a defining contrast like alternate form bootstraps. — EngrStudent, Mar 19 '21 at 18:46

score 1 · Accepted Answer · answered Mar 19 '21 at 19:00

Your example reminds me of the example I used in my answer to Algorithms for automatic model selection. If you could reasonably assume the latent ability and the conditional values were normal, you could fit an intercept only linear mixed model (LMM) and use the BLUPs from it. BLUPs are predicted from the estimated latent distribution and the individual's data. The more data you have from an individual, and the less noise there is in their data, the closer the BLUP will be to their own mean (see my answer to: Why do the estimated values from a Best Linear Unbiased Predictor (BLUP) differ from a Best Linear Unbiased Estimator (BLUE)?).

Using the code from my example to the linked answer, here is a concrete example of what I mean (coded in R). Note that since everyone has the same number of runs, and the same amount of noise, the ordering of the means and the BLUPs is the same. It's just that the BLUPs are more accurate by shrinking (er, regressing) the values toward the mean.

set.seed(59)
intrinsic_ability = runif(30, min=9, max=10)
time  = 31 - 2*intrinsic_ability + rnorm(30, mean=0, sd=.5)
time2 = 31 - 2*intrinsic_ability + rnorm(30, mean=0, sd=.5)
id    = paste0("id",1:30)

library(lme4)
m = lmer(c(time, time2)~1+(1|rep(id, times=2)))
d = data.frame(time1=time, time2=time2, mean=rowMeans(cbind(time, time2)),
               blup=predict(m)[1:30], true=31-2*intrinsic_ability)
head(d)
#   time1 time2 mean blup true
# 1  13.3  13.2 13.2 13.0 12.9
# 2  11.5  11.9 11.7 11.7 11.9
# 3  10.6  11.4 11.0 11.1 11.3
# 4  11.3  10.6 10.9 11.1 11.6
# 5  11.6  11.4 11.5 11.6 11.8
# 6  11.0  10.5 10.7 10.9 11.4
which.min(d$mean)  # [1] 6
summary(apply(d, 1, function(x){  (x[3]-x[5])**2  }))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   0.000   0.011   0.054   0.097   0.101   0.479 
summary(apply(d, 1, function(x){  (x[4]-x[5])**2  }))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  0.0001  0.0056  0.0198  0.0610  0.0705  0.2541

How to adjust estimates to account for regression to the mean?

1 Answers1