Can I remove significant interaction variable in regression model?

Question

I have a regression model, which consists of 4 predictors and 1 interaction - i dont know what are the correct symbols and terms, so I will drop my model here:

Y=B0+B1x1+B2X2+B3X3+B4X4+B5X3X4

So X4 is insignificant ( p-value is 0.90), while interaction is significant (p-value is 0.00000 ...)

The task asked me to analyse the model and reduce if possible , I have reduced it already, so it is my 2nd model in model building.I am not sure how it works? Can I remove the interaction and see if the variable is still insignificant and remove it as well?

I am non-stats. major, but stats. is part of my degree, so keep terms simple.

Just to clarify, you're asking whether you should remove the interaction term, test the effect of the predictor by itself, remove it if not significant, then add the interaction term back (without its main effect term)? I would say no. See e.g. https://stats.stackexchange.com/questions/11009/including-the-interaction-but-not-the-main-effects-in-a-model — Patrick Coulombe, Apr 10 '19 at 00:34
The removal of a variable on the basis of statistical significance can not be recommended. It may be interacting with other included factors/variables. — , Apr 09 '19 at 23:39
@PatrickCoulumbe No, i didn't mean to add, we were taught that we are not allowed to include any interactions, that do not include main effect in the model. Anyways, thanks for answer, I will figure it out now, going to talk to proffessors and teaching assistants in few hours:) — MichaelM, Apr 10 '19 at 04:26
If I understand correctly, your concern is the nonsignificant p-value for the main effect that is included in the interaction. Don't pay attention to this! If you have good reason to retain the interaction, you should retain the main effect as well. — mkt, Apr 11 '19 at 09:36
Please see https://stats.stackexchange.com/search?q=interaction+main+significant+regression for (many) threads about this issue. — whuber, Apr 24 '19 at 13:59

EdM · Answer 1 · 2019-04-16T21:56:16.070

You have to be very careful when interpreting the p-values reported for coefficients in multiple regression models that include interaction terms.

You don't specify which software you are using, so let's examine what happens with R under its default settings. With the X3:X4 interaction term included in the model, the coefficients B3 and B4 would represent the coefficients for X3 and X4 when the other of that pair is at its reference level (categorical predictor) or at 0 (continuous predictor). With a highly significant interaction term, it's quite possible that the coefficient B4 (with X3 at its reference level) would not be significantly different from 0 even if the X4 variable is very important when X3 is at some other level. And with the significant interaction, you can't properly interpret the effect of X3 unless you know the value of X4 and its associated coefficients. So you don't want to be omitting the B4 coefficient for X4 or the B5 coefficient for the X3:X4 interaction from your model. Predictions based on values of X3 and X4 will need both the direct and the interaction coefficients to be reliable.

One hint to consider: if either of X3 or X4 is a continuous variable, try centering its values around the mean or median before you perform a regression with the interaction term. Although this doesn't directly change predictions that would be made for any combination of predictor values, or the statistical significance that would be reported by analysis of variance that takes interactions into account, it can change the reported values of the individual coefficients and the individual-coefficient p-values in a way that better represents your intuition if the typical values of the continuous variable are far from the reference value of 0. See this page for some discussion.

Finally, do ask why you are being tasked to reduce your model. For predictive work that is typically counterproductive, and it doesn't really help with inference.

score 0 · Answer 2 · answered Apr 16 '19 at 20:43

There are two good choices for you. The first, and it is the lesser of the two, is step-wise regression. It will go through your model, gradually altering the model, looking for the best model as measured by some criterion such as the AIC or BIC. The second is better for theoretical reasons, amongst others. It is to solve a combinatoric solution.

You could construct a sequence of do loops or for-next loops that includes or excludes all possible combinations of variables. This is doing the same thing as step-wise, but you are covering every combination. Why this is of added value is that if there is a theoretical reason for some variable to be in the model, then you can force it to be in your model. Instead of using a loop to include it or exclude it, it must be there.

You should then calculate either the AIC or the BIC. The model with the smallest value of the information criterion you chose would win. They differ in how they penalize model complexity. With small sample sizes, the AIC will tend to overfit.

What does matter with the AIC or the BIC is the relative ranking of the set of models. If you had different data sets and different problems, an AIC of -23 for one problem couldn't be compared to an AIC of -22 for another. However, with the same data set facing the same problem, a -23 is a better model than a -22.

The difference in calculation between the two methods is how they penalize added model structure. The AIC adds a penalty of $2k$ to its value, where $k$ is the number of independent variables. A model with two independent variables has a penalty of 4, while one with three has a penalty of six. The BIC adds a penalty of $log(n)k$, where $n$ is the number of observations in the sample. As a rule of thumb, their results will be highly concordant.

A way to think about the trade-off between the AIC and the BIC is in what they are trying to accomplish. The BIC grants equal prior weight to each model, while the AIC grants lower prior weight to models with more parameters. The AIC does this by directly penalizing structure by having a fixed penalty for $k$. The BIC is making a tradeoff between sample size and structure.

If you have a small sample, it may improve the predictive power of the model to add a variable as you are increasing the amount of natural variation for the model to work with. On the other hand, once the sample size starts to become large, two variables that covary may be providing mostly the same information. It can cause the model to be a worse model because the effect of the added information from an added variable is being offset by the effect of colinearity to the point that it turns into noise.

In any case, the reason to use stepwise is that it is almost certainly a built-in function of your software. The reason to be combinatoric is that it is better. In any case, do not use p-values as a criterion even though they will be correlated with the AIC and BIC.

Stepwise model selection is [not a wise choice](https://stats.stackexchange.com/q/65898/28500), and the best-subset selection among predictors that you propose is not fundamentally better. Results with both approaches (including which predictors are selected) are typically highly dependent on the particular data sample at hand and overfit so that results don't generalize well to new cases. Try repeating best-subset selection on multiple bootstrap samples of a single data set with somewhat correlated predictors, and see how variable the choice of predictors can be. — EdM, Apr 16 '19 at 21:50

Can I remove significant interaction variable in regression model?

2 Answers2

Linked