How to justify a "complicated" regression model over a simpler model to a non-technical audience?

Question

I find myself in the position of advocating for a linear mixed-effects model to estimate a trend to a non-technical audience. The subject of the regression model is a somewhat contentious topic and my client and their adversaries have differing motives and desires. The goal of the model is to determine whether the response variable is declining over time and to estimate the decline. My client is claiming that there is a decline; their adversaries are claiming that there is not a decline. There are two audiences and two overarching concerns here. One concern is convincing my client why a more "complicated" model (meaning a linear mixed-effects model) is both more appropriate give the study design and will provide more consistent estimates of the trend. The second concern is convincing their adversaries that an alternative and "simpler" model may be providing misleading results.

I will be as specific as I can about the study design without revealing details of the study. The observational unit was sampled from the environment once per year at around the same time of year at several locations within the sampling frame. There is some unbalance in the design in that the number of observational units in the sample differed between earlier years and later years of the study. We are most interested in the trend with time, but we know that there are two continuous variables which may affect the observed response but which we are not strictly interested in. I will refer to these as nuisance variables. The distribution of these nuisance variables are not correlated with each other, nor are they correlated with time (I have checked this in the actual data), but there is a evidence of an interaction between the nuisance variables. Additionally, it seems reasonable to account for the fact that response in different years may be different from each other just to due to environmental variability and that response within each year are probably correlated with each other. It also seems reasonable to account for any difference between the different sampling locations within the sampling frame and the potential correlation between the samples collected at the same location. All of this points to a linear mixed-effects model. Specifically the model I am considering is:

$$ Y_{ijk} = \beta_0 + \beta_1*X_{1_j} + \beta_2*X_{2_{ijk}} + \beta_3*X_{3_{ijk}} + \beta_4*X_{2_{ijk}}*X_{3_{ijk}} + \gamma_j + \gamma_k + \epsilon_{ijk} $$

where:

$Y_{ijk}$ is the response for the $i^{th}$ observational unit in the $j^{th}$ year sampled at the $k^{th}$ sampling location
$\beta_0$ is the overall intercept
$\beta_1$ is the effect of year, $X_{1_j}$ on the response
$\beta_2$ is the effect of the first nuisance variable, $X_{2_{ijk}}$ on the response
$\beta_3$ is the effect of the second nuisance variable, $X_{3_{ijk}}$ on the response
$\beta_4$ is the effect of the interaction between the nuisance variables on the response
$\gamma_j$ is the random effect of year on the response; independently, identically distributed (i.i.d.) $N(0, \sigma_j)$
$\gamma_k$ is the random effect of sampling location on the response; i.i.d. $N(0, \sigma_k)$
$\epsilon_{ijk}$ is the residual variation of the response; i.i.d. $N(0, \sigma)$

Compare this to the simpler model: $$ Y_{ijk} = \beta_0 + \beta_1*X_{1_j} + \epsilon_{ijk} $$ where $\epsilon_{ijk}$ now "absorbs" all of the variation in the response accounted for by the fixed and random effects of above model. I'm anticipating that the adversaries of the client will consider this model, perhaps fit to subsets of the data from each location, to claim the trend is inconsistent or even non-existent at certain locations.

One obvious approach to resolving some these concerns is to simulate the data and fit the alternative models and compare the frequentist properties of the model (e.g. confidence coverage, power). I am pursuing this. However, I'm concerned that such an approach will too technical for these audiences and will provoke the reaction "I don't understand it, so you must be trying to pull one over on me."

Here's the question: I think what I'm looking for is an appeal to authority. My default setting as a statistician is to use all of the data at once (i.e. don't subset with out good reason) and to account for all the variation that is reasonable based on a scientific understanding of the problem. But this is based off of my graduate training and not any specific references. I realize that it may not be enough to point to my degree. I need a good, intuitive explanation for why not explicitly accounting for these potential sources of variation will lead to a worse model and sterling references for that contention. I'm thinking of explaining this as analogous to attenuation in an error-in-variables situation, but not sure that is totally appropriate analogy. Is the more "complicated" better for detecting and estimating the trend? If so, why? And who says so?

Bonus, R code for the simulation of the data:

Sim_Data <- data.frame(
  Year = c(rep(1998, 20),
           rep(1999, 20),
           rep(2000, 20),
           rep(2001, 20),
           rep(2002, 20),
           rep(2003, 20),
           rep(2004, 30),
           rep(2005, 30),
           rep(2006, 30),
           rep(2007, 30),
           rep(2008, 30),
           rep(2009, 30)),
  station = c(rep(c(rep("A", 4),
                rep("B", 4),
                rep("C", 4),
                rep("D", 4),
                rep("E", 4)),6),
          rep(c(rep("A", 10),
                rep("B", 5),
                rep("C", 5),
                rep("D", 5),
                rep("E", 5)),6))
)

Sim_Data$ScaleYear   = Sim_Data$Year - 2003
Sim_Data$Nuisance1   = rnorm(300, mean = -2.45, sd = 0.3)
Sim_Data$Nuisance2   = rgamma(300, shape = 2.4, rate = 3.5)

library(dplyr)

Sim  = Sim_Data %>% 
  mutate(ScaleYear   = Year - 2003,
         fYear       = factor(Year),
         Nuisance1   = rnorm(300, mean = -2.45, sd = 0.3),
         Nuisance2   = rgamma(300, shape = 2.4, rate = 3.5),
         resid       = rnorm(300, mean = 0, sd = 0.3)) %>% 
  group_by(Year) %>% 
  mutate(year_re = rnorm(1, sd = 0.1)) %>% 
  ungroup() %>% group_by(station) %>% 
  mutate(stat_int_re = rnorm(1, sd = 0.1)) %>% 
  ungroup() %>% 
  mutate(response = 2 - 0.025*ScaleYear + 
       0.6*Nuisance1 + 0.875*Nuisance2 + 0.3*Nuisance1*Nuisance2 +
       stat_int_re + year_re  + resid)

Just an update. This never got resolved and I quit this job to work somewhere where I could actually be a Statistician. — Dalton Hance, Oct 13 '20 at 22:36

score 3 · Answer 1 · answered Oct 11 '20 at 08:59

I need a good, intuitive explanation for why not explicitly accounting for these potential sources of variation will lead to a worse model

In my experience it is fairly easy to justify the use of mixed models to a non-techincal audience, at a level that they will be able to understand. I generally try to do this with real-world examples. The main justification for mixed model that I use is that it controls for non-independence of the observations due to repeated measures within a subject or group. Independence is one of the main assumptions of linear regression and should be familiar to anyone who has encountered regression. A good example is an investigation into the effect of a medication used to lower blood pressure. Let's say that we take 2 measurements, one at baseline and one at followup, say 1 month after starting the medication. It is obvious that if we had only 1 subject, not only would the result not be generalisable to any reasonable population, but with only 2 observations the model would be overfitted, indeed it would have perfect fit, so no inference would be possible. So right here, we can establish that fitting models to subsets of the data is not a good idea. Also, anyone who has even a rudimentary experience of statistics knows that a large sample is better. This does not require graduate level education at all (if the audience was not convinced on this point I would simply use another basic example, continuing on the blood pressure example, asking how we would estimate the average blood pressure of all the employees in the client's company. The larger the sample, the better the estimate, etc.)

Having discarded the idea of subsets, based on overfitting, and statistical power, we then consider a regression model ignoring the repeated measures within groups. Here, continuing on the blood pressure theme, I point out that there if we measure different people over a few points in time, the measures for each person will be more similar to each other than to measures of another person. Again, no advanced statistical knowledge is needed for this. It violoates the assumption of independence in linear regression. A person with high blood pressure today, will likely have high blood pressure tomorrow. Fitting models with random intercepts for subjects specifically accounts for this non-independence. Furthermore, we might expect that these correlations within individuals would diminish over time, or that the response to a drug may be different in different individuals. Again, no advanced statistical knowledge is needed. Mixed models allow for this to be explicitly modelled using random slopes.

and sterling references for that contention

The use of mixed models is well established in lots of different areas of applied research, so I would just point to some of the classic textbooks: (eg Bryk & Raudenbush (1992), Snijders & Bosker (2011) and Pinheiro & Bates (2000)). Between them, these 3 books have around 45,000 citations according to Google Scholar, which I believe qualifies them as "sterling" references.

Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Sage Publications, Inc.

Pinheiro, J., & Bates, D. (2000). Mixed-effects models in S and S-PLUS. Springer Science & Business Media.

Snijders, T. A., & Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Sage.

Well, four years later and I'm working for a different employer and using mixed models all the time. Thank you for your answer. — Dalton Hance, Oct 13 '20 at 22:34

How to justify a "complicated" regression model over a simpler model to a non-technical audience?

1 Answers1