2

This site has a couple of nice questions and answers dealing with Simpson's paradox and random effects (e.g. Simpson's Paradox & Random Effects, Understanding Simpson's paradox with random effects). However, these questions all deal with regimes where a random-effects model successfully (approximately) recovers the true within-group slopes. This is not always the case. Under some conditions, the random-effects model yields similar inference to the naive model, and only by incorporating fixed effects can the within-group slope be recovered. Moreover, in this regime the predictive performance of the random-effects model (that recovers the overall across-group slope) can be much better than that of the fixed-effects model (that recovers the true within-group slope).

I have a little R simulation below illustrating this issue.

This behavior is not mysterious to me--I understand reasonably well why it happens. My question is whether anyone knows of a nice, citable reference for the fact that if we really need our estimate of the within-group slope to reflect only within-group variation, random effects are not always a sufficient solution. I'm vaguely aware of some treatment of this in the econometrics literature based on panel data (and often using models fit using GLS), but would prefer a reference that is conversant with the maximum likelihood and/or Bayesian literature that's more familiar in the biological sciences.

Here's R code to play with (and a couple of extra asides below):

# This script simulates an example of Simpson's paradox.
# We then model the data twice, once with a random intercept for "species", and once with a fixed intercept for "species" (i.e. each species gets its own independent intercept)
# The random-incercept model consistently estimates a slope for "covariate" that is biased away from "slope_within_species"
# and towards "slope_across_species".  Where sd_x is small compared to both sd_y and 1, this effect can become really severe.

# Since I'm an ecologist, the grouping factor is called "species"

# for illustrative purposes, I suggest running the following parameter combinations:
# slope_across_species    slope_within_species    sd_x    sd_y
# 1                       0                       1       1           random-effect yields weakly significant positive slope; fixed-effect yields non-significant slope
# 1                       0                       .5      1           random-effect yields tremendously significant positive slope; fixed-effect yields non-significant slope
# 1                       -1                      .5      1           both models yield tremendously significant negative slope.
# 1                       -1                      .3      1           random-effect yields tremendously significant positive slope; fixed-effect yields tremendously significant negative slope


library(lmerTest)
# simulation parameters: 
# all of these parameters are measured on a scale relative to the expected standard deviation of the species-average covariate values, which is fixed at
# approximately one (it is stochastic in the simulation, but its sampling distribution will will be that of sd(rnorm(100)).
slope_across_species <- 1  # interspecific variation
slope_within_species <- 0  # intraspecific variation
sd_x <- 1 # within-species standard deviation in covariate values
sd_y <- 1 # residual standard deviation in observed data values (conditioned on the species and the covariate value)

# simulate data
set.seed(13)
species <- as.factor(rep(seq(1:100), 10))  # 100 species with 10 observations each
covariate_mean_by_species <- rep(rnorm(100), 10)   # The mean covariate value associated with each species
covariate <- rnorm(1000, mean = covariate_mean_by_species, sd = sd_x)  # The covariate values associated with each observation
thedata <- rnorm(1000, mean = slope_across_species * covariate_mean_by_species + slope_within_species * (covariate - covariate_mean_by_species), sd = sd_y)
df <- data.frame(species=species, covariate=covariate, thedata=thedata)

# plot the data
plot(thedata ~ covariate, data = df)   # plot all points
points(thedata ~ covariate, data = df[as.integer(df$species)< 10,], col = df$species[as.integer(df$species)< 10], pch = 10)  # highlight points from a handful of species

# run the models.
summary(lmer(thedata ~ covariate + (1|species), data = df))

summary(lm(thedata ~ covariate + species, data = df))

Aside 1: Importantly, model diagnostics can prefer the random-effects model to the fixed-effects model even when the slopes are opposite. This is expected; if the dominant effect in the data is the between-group variation explained by the covariate, then a model that estimates the between-group slope will achieve great predictive performance (and a model that estimates the within-group slope and then deals with the between-group variation using a ton of fixed effects will not achieve such great predictive performance).

Aside 2: I've confirmed the random-effect model results with Stan (you can too, by replacing the calls to lmer with calls to rstanarm::stan_lmer). The results from lme4 at the parameter combinations suggested above are robust. But...

Aside 3: If we start from the parameters slope_across_species = 1; slope_within_species = -1; sd_x = .3, sd_y = 1 and we incrementally increase sd_x, we see an extremely abrupt transition near sd_x = .32, where the estimated slope swings from positive (and highly significant) to negative (and highly significant) over almost no change in the data. In a fairly tight region from about 0.315 to 0.35, Stan also has some trouble sampling, suggesting that the posterior geometry becomes somehow pathological around this transition point.

(edited to fix typos in aside 3)

Jacob Socolar
  • 1,768
  • 10
  • 18

0 Answers0