8

Consider multiple linear regression. This question might be deceptively simple, but I'm trying to intuitively understand why, say if I have predictors X1 and X2, then interactions between these predictors can be adequately captured by X1 * X2.

I know interaction terms are modelled as products, just because that's what I was taught in school, and that's what everyone says to do. I'm guessing there is maybe some geometric argument.

But why is a product (of say two numeric features, and not the extra complexity of multiplying by one being a dummy variable while other is numeric etc) going to adequately capture interactions?

Why are not "interactions" best captured by another f(X1, X2) by default instead of specifically X1 * X2?

I can see the idea that X1 * X2 may capture situations where the signs of X1 and X2 are the same or not, but then why would not, say, by default interactions be modelled by say f(X1, X2) = sign(X1) * sign(X2) instead of f(X1, X2) = X1X2?

I realise I can add any other f(X1, X2) to a regression or any predictive model, but finding the exact shape of interactions by hand coding is time consuming. How do I know X1X2 is a good first guess?

1 Answers1

8

We can conceive of an "interaction" between regressor variables $x_1$ and $x_2$ as a departure from a perfectly linear relationship in which the relationship between one regressor and the response is different for different values of the other regressors. The usual "interaction term" is, in a sense to be explained below, a "simplest" such departure.

Definitions and Concepts

"Linear relationship" simply means the usual model in which we suppose a response $Y$ differs from a linear combination of the $x_i$ (and a constant) by independent, zero-mean errors $\varepsilon:$

$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon.\tag{*}$$

"Interaction," in the most general sense, means the parameters $\beta_i$ may depend on other variables.

Specifically, in this example of just two regressors, we might generically write

$$\beta_1 = \beta_1(x_2)\text{ and }\beta_2 = \beta_2(x_1).$$

Analysis

Now, in practice, nobody except a theoretical physicist really believes model $(*)$ is fully accurate: it's an approximation to the truth and, we hope, a close one. Pursuing this idea further, we might ask whether we could similarly approximate the functions $\beta_i$ with linear ones in case we need to model some kind of interaction. Specifically, we could try to write

$$\beta_1(x_2) = \gamma_0 + \gamma_1 x_2 + \text{ tiny error}_1;$$ $$\beta_2(x_1) = \delta_0 + \delta_1 x_1 + \text{ tiny error}_2.$$

Let's see where that leads. Plugging these linear approximations into $(*)$ gives

$$\eqalign{ Y &= \beta_0 + \beta_1(x_2) x_1 + \beta_2(x_1) x_2 + \varepsilon \\ &= \beta_0 + (\gamma_0 + \gamma_1 x_2 + \text{ tiny error}_1)x_1 + (\delta_0 + \delta_1 x_1 + \text{ tiny error}_2)x_2 + \varepsilon \\ &= \beta_0 + \gamma_0 x_1 + \delta_0 x_2 + (\gamma_1 + \delta_1)x_1 x_2 + \ldots }$$

where "$\ldots$" represents the total error,

$$\ldots = (\text{ tiny error}_1)x_1 + (\text{ tiny error}_2)x_2 + \varepsilon.$$

With any luck, multiplying those two "tiny errors" by typical values of the $x_i$ will either (a) be inconsequential compared to $\varepsilon$ or (b) can be treated as random terms which, when added to $\varepsilon$ (and maybe adjusting the constant term $\beta_0$ to accommodate any systematic bias) can be treated as a random error term.

In either case, with a change of notation we see that this linear-approximation-to-an-interaction model takes the form

$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1 x_2 + \varepsilon,\tag{**}$$

which is precisely the usual "interaction" regression model. (Note that none of the new parameters, nor $\varepsilon$ itself, is the same quantity originally represented by those terms in $(*).$)

Observe how $\beta_{12}$ arises through variation in both the original parameters. It captures the combination of (i) how the coefficient of $x_1$ depends on $x_2$ (namely, through $\gamma_1$) and (ii) how the coefficient of $x_2$ depends on $x_1$ (through $\delta_1$).


Some Consequences

It is a consequence of this analysis that if we fix all but one of the regressors, then (conditionally) the response $Y$ is still a linear function of the remaining regressor. For instance, if we fix the value of $x_2,$ then we may rewrite the interaction model $(**)$ as

$$Y = (\beta_0 + \beta_2 x_2) + (\beta_1 + \beta_{12} x_2) x_1 + \varepsilon,$$

where the intercept is $\beta_0 + \beta_2 x_2$ and the slope (that is, the $x_1$ coefficient) is $\beta_1 + \beta_2 x_2.$ This allows for easy description and insight. Geometrically, the surface given by the function

$$f(x_1,x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1x_2$$

is ruled: when we slice it parallel to either of the coordinate axes, the result is always a line. (However, the surface itself is not planar except when $\beta_{12}=0.$ Indeed, it everywhere has negative Gaussian curvature.)

Finally, if our hope for (a) or (b) does not pan out, we might further expand the functional behavior of the original $\beta_i$ to include terms of second order or higher. Carrying out the same analysis shows this will introduce terms of the form $x_1^2,$ $x_2^2,$ $x_1x_2^2,$ $x_1^2x_2,$ and so forth into the model. In this sense, including a (product) interaction term is merely the first--and simplest--step towards modeling nonlinear relationships between the response and the regressors by means of polynomial functions.

Finally, in his textbook EDA (Addison-Wesley 1977), John Tukey showed how this approach can be carried out far more generally. After first "re-expressing" (that is, applying suitable non-linear transformations to) the regressors and the response, it often is the case that either model $(*)$ applies to the transformed variables or, if not, model $(**)$ can easily be fit (using a robust analysis of residuals). This allows for a huge variety of nonlinear relationships to be expressed and interpreted as conditionally linear responses.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    This is a lovely detailed answer. Thank you. Aside, I keep seeing references to Tukey's book coming up on this site... even though it's so old. Perhaps it's time to take a read of it. – ChilliProject Mar 06 '19 at 08:37