0

sorry if this is a total newbie question,but can I run the following regression formula: Y= X1 + X2 + X3 + X1* X2 +X1* X3 without adding upper level interaction variables, such as X1 * X2 *X3?

Thanks

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    This is *multiple* regression, not "multivariate." The latter is the case of two or more *response* variables $Y_1,Y_2,\ldots$ analyzed simultaneously. BTW, a useful thought experiment is to consider a dataset with a fairly large number $k$ of predictors $X_i.$ If you were to insist on including *all* interactions, right through the $k$-way interaction, you would have to estimate $2^k-1$ coefficients, therefore requiring *at least* $2^k$ observations. For instance, with just a modest $k=50,$ how many studies do you suppose could afford to make $2^{50}\gt10^{15}$ = one quadrillion observations? – whuber Feb 21 '22 at 15:55

1 Answers1

0

Yes, that's perfectly acceptable. It's usually best to base you model on your understanding of the subject matter. If your understanding of the subject matter indicates that only the X1:X2 and X1:X3 interaction are likely to be important, that would generally be OK. Certainly there is no rule in statistics against doing that. There can be a problem when you include interactions and omit individual terms for the predictors, however, as discussed here.

As you are just starting to learn about this, recognize that there is a tradeoff that involves the art of statistical modeling.

It can be best to start with as complex a model as possible that won't overfit your data. See Section 4.1 of Frank Harrell's course notes or book. That could involve several levels of interactions, flexible modeling of continuous predictors, etc. If you have a very large data set, that can be a more productive approach, particularly if your interest is in prediction.

With a more complex model and a limited data set, however, you run risks of overfitting and finding spurious "significant" effects or, as you have to estimate more coefficients from your data with a complex model, losing power to find truly significant effects. With that in mind, only you and your colleagues can weight the benefits against the risks of a more complex model in any particular circumstance. Harrell's course notes and book provide useful guidance on this.

EdM
  • 57,766
  • 7
  • 66
  • 187