I find myself in the position of advocating for a linear mixed-effects model to estimate a trend to a non-technical audience. The subject of the regression model is a somewhat contentious topic and my client and their adversaries have differing motives and desires. The goal of the model is to determine whether the response variable is declining over time and to estimate the decline. My client is claiming that there is a decline; their adversaries are claiming that there is not a decline. There are two audiences and two overarching concerns here. One concern is convincing my client why a more "complicated" model (meaning a linear mixed-effects model) is both more appropriate give the study design and will provide more consistent estimates of the trend. The second concern is convincing their adversaries that an alternative and "simpler" model may be providing misleading results.
I will be as specific as I can about the study design without revealing details of the study. The observational unit was sampled from the environment once per year at around the same time of year at several locations within the sampling frame. There is some unbalance in the design in that the number of observational units in the sample differed between earlier years and later years of the study. We are most interested in the trend with time, but we know that there are two continuous variables which may affect the observed response but which we are not strictly interested in. I will refer to these as nuisance variables. The distribution of these nuisance variables are not correlated with each other, nor are they correlated with time (I have checked this in the actual data), but there is a evidence of an interaction between the nuisance variables. Additionally, it seems reasonable to account for the fact that response in different years may be different from each other just to due to environmental variability and that response within each year are probably correlated with each other. It also seems reasonable to account for any difference between the different sampling locations within the sampling frame and the potential correlation between the samples collected at the same location. All of this points to a linear mixed-effects model. Specifically the model I am considering is:
$$ Y_{ijk} = \beta_0 + \beta_1*X_{1_j} + \beta_2*X_{2_{ijk}} + \beta_3*X_{3_{ijk}} + \beta_4*X_{2_{ijk}}*X_{3_{ijk}} + \gamma_j + \gamma_k + \epsilon_{ijk} $$
where:
- $Y_{ijk}$ is the response for the $i^{th}$ observational unit in the $j^{th}$ year sampled at the $k^{th}$ sampling location
- $\beta_0$ is the overall intercept
- $\beta_1$ is the effect of year, $X_{1_j}$ on the response
- $\beta_2$ is the effect of the first nuisance variable, $X_{2_{ijk}}$ on the response
- $\beta_3$ is the effect of the second nuisance variable, $X_{3_{ijk}}$ on the response
- $\beta_4$ is the effect of the interaction between the nuisance variables on the response
- $\gamma_j$ is the random effect of year on the response; independently, identically distributed (i.i.d.) $N(0, \sigma_j)$
- $\gamma_k$ is the random effect of sampling location on the response; i.i.d. $N(0, \sigma_k)$
- $\epsilon_{ijk}$ is the residual variation of the response; i.i.d. $N(0, \sigma)$
Compare this to the simpler model: $$ Y_{ijk} = \beta_0 + \beta_1*X_{1_j} + \epsilon_{ijk} $$ where $\epsilon_{ijk}$ now "absorbs" all of the variation in the response accounted for by the fixed and random effects of above model. I'm anticipating that the adversaries of the client will consider this model, perhaps fit to subsets of the data from each location, to claim the trend is inconsistent or even non-existent at certain locations.
One obvious approach to resolving some these concerns is to simulate the data and fit the alternative models and compare the frequentist properties of the model (e.g. confidence coverage, power). I am pursuing this. However, I'm concerned that such an approach will too technical for these audiences and will provoke the reaction "I don't understand it, so you must be trying to pull one over on me."
Here's the question: I think what I'm looking for is an appeal to authority. My default setting as a statistician is to use all of the data at once (i.e. don't subset with out good reason) and to account for all the variation that is reasonable based on a scientific understanding of the problem. But this is based off of my graduate training and not any specific references. I realize that it may not be enough to point to my degree. I need a good, intuitive explanation for why not explicitly accounting for these potential sources of variation will lead to a worse model and sterling references for that contention. I'm thinking of explaining this as analogous to attenuation in an error-in-variables situation, but not sure that is totally appropriate analogy. Is the more "complicated" better for detecting and estimating the trend? If so, why? And who says so?
Bonus, R code for the simulation of the data:
Sim_Data <- data.frame(
Year = c(rep(1998, 20),
rep(1999, 20),
rep(2000, 20),
rep(2001, 20),
rep(2002, 20),
rep(2003, 20),
rep(2004, 30),
rep(2005, 30),
rep(2006, 30),
rep(2007, 30),
rep(2008, 30),
rep(2009, 30)),
station = c(rep(c(rep("A", 4),
rep("B", 4),
rep("C", 4),
rep("D", 4),
rep("E", 4)),6),
rep(c(rep("A", 10),
rep("B", 5),
rep("C", 5),
rep("D", 5),
rep("E", 5)),6))
)
Sim_Data$ScaleYear = Sim_Data$Year - 2003
Sim_Data$Nuisance1 = rnorm(300, mean = -2.45, sd = 0.3)
Sim_Data$Nuisance2 = rgamma(300, shape = 2.4, rate = 3.5)
library(dplyr)
Sim = Sim_Data %>%
mutate(ScaleYear = Year - 2003,
fYear = factor(Year),
Nuisance1 = rnorm(300, mean = -2.45, sd = 0.3),
Nuisance2 = rgamma(300, shape = 2.4, rate = 3.5),
resid = rnorm(300, mean = 0, sd = 0.3)) %>%
group_by(Year) %>%
mutate(year_re = rnorm(1, sd = 0.1)) %>%
ungroup() %>% group_by(station) %>%
mutate(stat_int_re = rnorm(1, sd = 0.1)) %>%
ungroup() %>%
mutate(response = 2 - 0.025*ScaleYear +
0.6*Nuisance1 + 0.875*Nuisance2 + 0.3*Nuisance1*Nuisance2 +
stat_int_re + year_re + resid)