Factor significant within model but non significant after drop?

Question

This may be quite a basic question but I was running a simple linear model and dropping non significant terms until I got to a minimal model. When this was reached, I was obtaining the significance for the explanatory variables that remained also by dropping them one at a time. Now, what I am getting is that one of those variables in my minimal model becomes non-significant after I drop it and I am not quite understanding what is going on. If someone could give me an hint, that would be great (code is below).

So my minimal model was:

m1<-lm(log10(para.ml) ~ treat + prop.r + log(od))

Analysis of Variance Table

Response: log10(para.ml)
                 Df Sum Sq Mean Sq F value    Pr(>F)    
treat             3 4.2925 1.43083  30.113 9.181e-09 ***
prop.r            1 1.5419 1.54190  32.451 4.723e-06 ***
log(od)           1 0.5698 0.56981  11.992  0.001796 ** 
Residuals        27 1.2829 0.04751

So here treat is a factor with 4 levels and both prop.r and log(od) are continuous variables. As you can see, all effects look significant and if I drop prop.r or log(od), model m1 is still preferred. Though the same does not happen if I drop treat:

m2<-update(m1,~.-treat)
anova(m2,m1)

Analysis of Variance Table

Model 1: log10(para.ml) ~ prop.r + log(od)
Model 2: log10(para.ml) ~ treat + prop.r + log(od)
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     30 1.6202                              
2     27 1.2829  3   0.33728 2.3662 0.09308 .

now, if I get the anova table for this last model, I obtain this:

anova(m2)

Analysis of Variance Table

Response: log10(para.ml)
                 Df Sum Sq Mean Sq F value    Pr(>F)    
prop.r            1 4.8607  4.8607  90.003 1.537e-10 ***
log(od)           1 1.2062  1.2062  22.335 5.045e-05 ***
Residuals        30 1.6202  0.0540

So, comparing anova(m2) and anova(m1) it looks like most of the variation that was being explained by treat is now explained by prop.r or by log(od). Thus is it just the case that when I have treat in my model it just explains a lot of the variation that could just be explained by the other variables?

Any help appreciated!

A related phenomenon is discussed at http://stats.stackexchange.com/questions/14500/. You seem to be performing an *ad hoc* version of "stepwise regression," which has been [extensively discussed](http://stats.stackexchange.com/search?q=%2Bstepwise+%2Bregression) on this site. There is practically a consensus that this method is best avoided (or used with care by experts in particular situations); there are better model-building techniques available. — whuber, Jun 13 '12 at 14:24

score 3 · Answer 1 · answered Jul 13 '12 at 19:20

In the first code block you do not show how you produced the analysis of variance table. It looks likely that you used anova(m1). If that is the case then the table produced is the sequential table, which means that the first line shows the effect of 'treat' by itself (not adjusting for the other terms), then the 'prop.r' row is the effect of 'prop.r' adjusting for 'treat' but not the other variable, etc.

In that case with some relationship between the predictor variables it is possible that 'treat' is related by itself, but is redundant given the other variables. This is what is looks like is happening. The second time you use anova you get different results because the adjusting is different.

score 2 · Answer 2 · answered Jul 13 '12 at 20:01

It could be related to the type of sums of squares that R uses for lm. I can't find the documentation, but lm() might be using Type 1 sums of squares. For Type 1 sums of squares, the order of the variables in the model matters (see here for a summary).

treat might be correlated with the other variables in the model. If so, by having it first, it may explain much of the variance that could be explained by the other variables. Once you remove it, the other variables might explain that variance, meaning that it is no longer significant. Try rearranging the order in your minimal model to have treat last and see if it is still significant.

I realized that this answer practically duplicates @Greg's. I've added this to his answer to provide a more complete response. — Oliver, Jul 13 '12 at 20:10

score 1 · Answer 3 · answered Jun 13 '12 at 12:01

1

The variables you choose for your model can be correlated with each other. This may mean that there is no "best" subset. The significance of a variable thus depends on what the other variables in the model are. Their importance really comes in as to how well they work together to fit the data. So when included with certain other variables a particular variable might appear to add significantly to the model by being added whereas with another set of variables in the model it might not. As an oversimplified example suppose X1 can be expressed as a linear comination of X2, X3 and X4. Theb if X2, X3 and X4 are already in the model X1 has no significant effect to add to the model. But if at least one of the variables X2, X3 and X4 is not in the model when X1 is being evaluated, X1 could turn out to significantly improve the model fit.

answered Jun 13 '12 at 12:01

Michael R. Chernick

39,640
28
74
143

Hi Michael, thank you very much for your reply. That does make sense to me. But in my case what I am getting is that my X1 (to use your notation) seems significant in the presence of my X2 and X3, but isn't when I drop it from the model. I guess what I don't understand well is why in the first model some of the variation is being explained by my X1 (treat) and not by the other variables, since that is what happens when I simplify the model. – ramiro Jun 13 '12 at 13:34
@ricardo That sounds simple to explain. Suppose Z is a linear function of X1, X2, and X3 and Z is a useful predictor of the model response. Then in combination X1, X2, and X3 are significant in explaining y but none of them by themselves would be. – Michael R. Chernick Jun 13 '12 at 14:41

Factor significant within model but non significant after drop?

3 Answers3

Linked

Related