0

Suppose you have variables $X_{1}, \dots, X_{100}$ (can be discrete or continuous). If you want to test all possible interactions, should you include them all in 1 model? We are doing linear regression here.

Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
Thomy Ja
  • 109
  • 1
  • 3

2 Answers2

2

No. That would be about $10^{30}$ interactions, which is probably too many variables for the data you have available.

Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
  • So how would you choose which variables to test for interaction? – Thomy Ja Mar 07 '12 at 03:46
  • 1
    *I* would hope to have some theoretical basis for choosing a subset of variables to look at. If you want the data to dictate your choice you are in the world of data mining (I've added the data-mining tag) and should look up some of the methods there. – Peter Ellis Mar 07 '12 at 03:49
  • 1
    Where do you get $10^{30}$? Aren't there only $\binom{100}{2} = 4950$ possible pairwise interactions? – Macro Mar 07 '12 at 03:52
  • 1
    The question was all possible interactions: x – Peter Ellis Mar 07 '12 at 04:01
  • 2
    First, resolve whether you mean *all* possible interactions (which is how @PeterEllis interpreted your question) or just all possible *two-way* interactions (which is how @macro) interpreted it. But I'd be cautious about a model with 100 variables, much less interactions among them. It's hard to conceive how such a model could be interpreted. What's the context? – Peter Flom Mar 07 '12 at 12:16
  • You're right @PeterEllis. I guess I assumed he couldn't possibly be talking about _all possible_ interactions, since those would be a nightmare to interpret. – Macro Mar 07 '12 at 16:01
  • @PeterEllis - I didn't understand your notation, but using XL I got 1.369*10^30: does that match your result? Checking whether my method works. – rolando2 Mar 08 '12 at 01:56
  • I found interesting the related discussion at http://stats.stackexchange.com/questions/16480/how-to-quickly-select-important-variables-from-a-very-large-dataset – rolando2 Mar 08 '12 at 02:03
  • @rolando2, sorry about my notation, it should work if it's pasted as a single line into R. I don't have the knack yet of the markup language for "comments". I get 1.268*10^30. – Peter Ellis Mar 08 '12 at 13:57
  • @Peter I have created regression models with 200+ variables using (many) two-way interactions. The objective was *prediction* rather than interpretability, but interpretation was no problem: it's the same whether there are three variables or millions of them. The challenge is to find which of the tens of thousands of interactions are useful. – whuber Oct 31 '19 at 19:08
  • @whuber I'm sure there are cases where this is sensible - but I just said be cautious. Yes, the interpretation is the same, mathematically, we're just controlling for more variables. But, substantively, I think there are few situations where controlling for 100 variables is going to be easy to interpret. – Peter Flom Nov 01 '19 at 11:34
2

Quoting the wise Andrew Gelman and Jennifer Hill (see p. 36) on hunting interactions:

In practice, inputs that have large main effects also tend to have large interactions with other inputs (however, small main effects do not preclude the possibility of large interactions). For example, smoking has a huge effect on cancer. In epidemiologial studies of other carcinogens, it is crucial to adjust for smoking both as a main effect and as an interaction[...]: high levels of radon are associated with greater likelihood of cancer but this difference is much greater for smokers than for nonsmokers.

dimitriy
  • 31,081
  • 5
  • 63
  • 138