This is going to look like a duplicate of a common question--something along the lines of "Should/can I remove insignificant regression terms?" That type of question has been asked--and answered--here and here and many other places besides.
My question is more specific and, I hope, grapples more with the nuance at the core of the issue.
Let's say I have a simple linear regression of the form:
Y ~ X1 + X2 + Z1 + Z2
In this case, X1
and X2
are binary factors variables (i.e. "treated" versus "untreated") representing two different treatments whose potential impacts on some outcome variable Y
I have concrete hypotheses about. Z1
and Z2
, meanwhile, are some kind of continuous measures of how relatively intense the corresponding treatment was for each unit studied. For example, if X2
is "leaf removal," Z2
might be the exact number of leaves removed from each plant, which maybe varied for some unavoidable reason.
Now, I'm not interested in knowing whether Y
dropped more for plants that had 40 leaves removed versus 38 leaves removed (Z2
), though I could believe that that could happen. That within-treatment variability is not my focus. I'm interested in knowing simply if leaf removal in general resulted in a significant change in Y
(X2
). I include Z1
and Z2
only as a way to account for a possible additional source of variance in Y
that I know about and would like to account for, if it matters, but that I am not particularly interested in.
Now, when I run my regression, I find that Z1
and Z2
are not remotely significant, nor are their effect sizes large. X1
and X2
aren't significant either. But if I remove Z1
and Z2
, suddenly X1
and X2
become significant. So, obviously, the inclusion of those two terms affects my conclusions.
Now, here are my thoughts:
Z1
andZ2
are not apparently accounting for as much variance in the data as I thought they would.- Maybe my estimate of them is too imprecise to be useful and may actually be reducing my power to detect a treatment effect?
- Removing them increases the simplicity of my model, making it easier to describe.
Z1
andZ2
are not central to my hypotheses, nor are they particularly germane in the literature available on this question. That is, other, similar studies don't generally include them, and no one will think it is strange if I don't either.
All of that said, do I really have a reasonable justification to remove Z1
and Z2
from this model for value Y
when, for value K
in a different model of mine, they would be predictive and thus should be left in?
My conclusion is, so far, that I either have to conclude they are not useful predictors and remove them from all my models or conclude that they may be useful predictors at least some of the time and leave them in all my models. Otherwise, I feel as though I am somehow having my cake and eating it also. Is this a fair conclusion to reach? Or are there additional ethical and practical considerations that need to be made here that I am unaware of?