If a regression term doesn't do what it was intended to do, is it alright to remove it?

Question

This is going to look like a duplicate of a common question--something along the lines of "Should/can I remove insignificant regression terms?" That type of question has been asked--and answered--here and here and many other places besides.

My question is more specific and, I hope, grapples more with the nuance at the core of the issue.

Let's say I have a simple linear regression of the form:

Y ~ X1 + X2 + Z1 + Z2

In this case, X1 and X2 are binary factors variables (i.e. "treated" versus "untreated") representing two different treatments whose potential impacts on some outcome variable Y I have concrete hypotheses about. Z1 and Z2, meanwhile, are some kind of continuous measures of how relatively intense the corresponding treatment was for each unit studied. For example, if X2 is "leaf removal," Z2 might be the exact number of leaves removed from each plant, which maybe varied for some unavoidable reason.

Now, I'm not interested in knowing whether Y dropped more for plants that had 40 leaves removed versus 38 leaves removed (Z2), though I could believe that that could happen. That within-treatment variability is not my focus. I'm interested in knowing simply if leaf removal in general resulted in a significant change in Y (X2). I include Z1 and Z2 only as a way to account for a possible additional source of variance in Y that I know about and would like to account for, if it matters, but that I am not particularly interested in.

Now, when I run my regression, I find that Z1 and Z2 are not remotely significant, nor are their effect sizes large. X1 and X2 aren't significant either. But if I remove Z1 and Z2, suddenly X1 and X2 become significant. So, obviously, the inclusion of those two terms affects my conclusions.

Now, here are my thoughts:

Z1 and Z2 are not apparently accounting for as much variance in the data as I thought they would.
Maybe my estimate of them is too imprecise to be useful and may actually be reducing my power to detect a treatment effect?
Removing them increases the simplicity of my model, making it easier to describe.
Z1 and Z2 are not central to my hypotheses, nor are they particularly germane in the literature available on this question. That is, other, similar studies don't generally include them, and no one will think it is strange if I don't either.

All of that said, do I really have a reasonable justification to remove Z1 and Z2 from this model for value Y when, for value K in a different model of mine, they would be predictive and thus should be left in?

My conclusion is, so far, that I either have to conclude they are not useful predictors and remove them from all my models or conclude that they may be useful predictors at least some of the time and leave them in all my models. Otherwise, I feel as though I am somehow having my cake and eating it also. Is this a fair conclusion to reach? Or are there additional ethical and practical considerations that need to be made here that I am unaware of?

Added some clarification, but this is notation I've seen elsewhere on this site many times. `X1`, `X2`, `Z1`, and `Z2` are your standard independent variables/linear predictors/fixed factors, whereas Y is some outcome variable of interest. — Bajcz, Jul 29 '16 at 21:22

Todd D · Answer 1 · 2016-07-29T22:13:41.163

My sense is that your variables X1 and X2 are multicollinear with Z1 and Z2, respectively. Some or most of the information contained in X1 and X2 is contained in either of Z1 or Z2. Z1 cannot be zero if X1 indicates treatment.

From the standpoint of a scientific approach, if your prospective hypothesis was framed in the setting of adjustment for either of Z1 or Z2, then I would suggest those variables not be removed after results of testing are obtained to obtain statistical significance. However, including those variables may have been ill-advised at the outset given what appears to be the intrinsic correlation between the independent variables.

This is not so much a statistics question as a research design question. The statistics are merely shedding light on what seems to be a design flaw.

Edit: As for Z1 and Z2 "not accounting for much variance," they will not appear to account for much variance if most of the effect on the dependent variables is obtained from knowledge of either of the binary treatment variables. The effect this has is increase in the standard deviation of the correlated variables's standard error. A formal analysis of this can be done with estimation of variance inflation factors for the beta-coefficients you obtain from the regression model.

Can you expand this answer a bit more? The OP claims "Z1 and Z2 are not apparently accounting for as much variance in the data as I thought they would", and "If I remove Z1 and Z2, suddenly X1 and X2 become significant.". Perhaps clarification from the OP about whether or not the full model results are borderline significant for X1 and X2 would help clarify\ — AdamO, Jul 29 '16 at 21:53

score 2 · Answer 2 · answered Jul 29 '16 at 22:19

The only reason you would lose precision after adjusting for other factors in a randomized study is if

1) the other factors are correlated with randomization assignment. This means you did a bad job of randomizing, and further implies that the adjusted results are correct and the unadjusted results are confounded.

2) The other factors are independent of both the outcome and the randomization assignment. This means that the tiny loss of degrees of freedom from adjusting for all of 2 parameters has led from borderline statistical significance (on one side of the 0.05 threshold) to borderline statistical significance (on the other side of the 0.05 threshold). If that happens, you should expand your mind and, instead of reported a p-value or statistical significance--a universally useless measure--report a 95% confidence interval.

If a regression term doesn't do what it was intended to do, is it alright to remove it?

2 Answers2