In multiple regression, why are interactions modelled as products, and not something else, of the predictors?

Question

Consider multiple linear regression. This question might be deceptively simple, but I'm trying to intuitively understand why, say if I have predictors X1 and X2, then interactions between these predictors can be adequately captured by X1 * X2.

I know interaction terms are modelled as products, just because that's what I was taught in school, and that's what everyone says to do. I'm guessing there is maybe some geometric argument.

But why is a product (of say two numeric features, and not the extra complexity of multiplying by one being a dummy variable while other is numeric etc) going to adequately capture interactions?

Why are not "interactions" best captured by another f(X1, X2) by default instead of specifically X1 * X2?

I can see the idea that X1 * X2 may capture situations where the signs of X1 and X2 are the same or not, but then why would not, say, by default interactions be modelled by say f(X1, X2) = sign(X1) * sign(X2) instead of f(X1, X2) = X1X2?

I realise I can add any other f(X1, X2) to a regression or any predictive model, but finding the exact shape of interactions by hand coding is time consuming. How do I know X1X2 is a good first guess?

score 8 · Accepted Answer · edited Jun 11 '20 at 14:32

We can conceive of an "interaction" between regressor variables $x_1$ and $x_2$ as a departure from a perfectly linear relationship in which the relationship between one regressor and the response is different for different values of the other regressors. The usual "interaction term" is, in a sense to be explained below, a "simplest" such departure.

Definitions and Concepts

"Linear relationship" simply means the usual model in which we suppose a response $Y$ differs from a linear combination of the $x_i$ (and a constant) by independent, zero-mean errors $\varepsilon:$

$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon.\tag{*}$$

"Interaction," in the most general sense, means the parameters $\beta_i$ may depend on other variables.

Specifically, in this example of just two regressors, we might generically write

$$\beta_1 = \beta_1(x_2)\text{ and }\beta_2 = \beta_2(x_1).$$

Analysis

Now, in practice, nobody except a theoretical physicist really believes model $(*)$ is fully accurate: it's an approximation to the truth and, we hope, a close one. Pursuing this idea further, we might ask whether we could similarly approximate the functions $\beta_i$ with linear ones in case we need to model some kind of interaction. Specifically, we could try to write

$$\beta_1(x_2) = \gamma_0 + \gamma_1 x_2 + \text{ tiny error}_1;$$ $$\beta_2(x_1) = \delta_0 + \delta_1 x_1 + \text{ tiny error}_2.$$

Let's see where that leads. Plugging these linear approximations into $(*)$ gives

$$\eqalign{ Y &= \beta_0 + \beta_1(x_2) x_1 + \beta_2(x_1) x_2 + \varepsilon \\ &= \beta_0 + (\gamma_0 + \gamma_1 x_2 + \text{ tiny error}_1)x_1 + (\delta_0 + \delta_1 x_1 + \text{ tiny error}_2)x_2 + \varepsilon \\ &= \beta_0 + \gamma_0 x_1 + \delta_0 x_2 + (\gamma_1 + \delta_1)x_1 x_2 + \ldots }$$

where "$\ldots$" represents the total error,

$$\ldots = (\text{ tiny error}_1)x_1 + (\text{ tiny error}_2)x_2 + \varepsilon.$$

With any luck, multiplying those two "tiny errors" by typical values of the $x_i$ will either (a) be inconsequential compared to $\varepsilon$ or (b) can be treated as random terms which, when added to $\varepsilon$ (and maybe adjusting the constant term $\beta_0$ to accommodate any systematic bias) can be treated as a random error term.

In either case, with a change of notation we see that this linear-approximation-to-an-interaction model takes the form

$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1 x_2 + \varepsilon,\tag{**}$$

which is precisely the usual "interaction" regression model. (Note that none of the new parameters, nor $\varepsilon$ itself, is the same quantity originally represented by those terms in $(*).$)

Observe how $\beta_{12}$ arises through variation in both the original parameters. It captures the combination of (i) how the coefficient of $x_1$ depends on $x_2$ (namely, through $\gamma_1$) and (ii) how the coefficient of $x_2$ depends on $x_1$ (through $\delta_1$).

Some Consequences

It is a consequence of this analysis that if we fix all but one of the regressors, then (conditionally) the response $Y$ is still a linear function of the remaining regressor. For instance, if we fix the value of $x_2,$ then we may rewrite the interaction model $(**)$ as

$$Y = (\beta_0 + \beta_2 x_2) + (\beta_1 + \beta_{12} x_2) x_1 + \varepsilon,$$

where the intercept is $\beta_0 + \beta_2 x_2$ and the slope (that is, the $x_1$ coefficient) is $\beta_1 + \beta_2 x_2.$ This allows for easy description and insight. Geometrically, the surface given by the function

$$f(x_1,x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1x_2$$

is ruled: when we slice it parallel to either of the coordinate axes, the result is always a line. (However, the surface itself is not planar except when $\beta_{12}=0.$ Indeed, it everywhere has negative Gaussian curvature.)

Finally, if our hope for (a) or (b) does not pan out, we might further expand the functional behavior of the original $\beta_i$ to include terms of second order or higher. Carrying out the same analysis shows this will introduce terms of the form $x_1^2,$ $x_2^2,$ $x_1x_2^2,$ $x_1^2x_2,$ and so forth into the model. In this sense, including a (product) interaction term is merely the first--and simplest--step towards modeling nonlinear relationships between the response and the regressors by means of polynomial functions.

Finally, in his textbook EDA (Addison-Wesley 1977), John Tukey showed how this approach can be carried out far more generally. After first "re-expressing" (that is, applying suitable non-linear transformations to) the regressors and the response, it often is the case that either model $(*)$ applies to the transformed variables or, if not, model $(**)$ can easily be fit (using a robust analysis of residuals). This allows for a huge variety of nonlinear relationships to be expressed and interpreted as conditionally linear responses.

This is a lovely detailed answer. Thank you. Aside, I keep seeing references to Tukey's book coming up on this site... even though it's so old. Perhaps it's time to take a read of it. — ChilliProject, Mar 06 '19 at 08:37

In multiple regression, why are interactions modelled as products, and not something else, of the predictors?

1 Answers1

Definitions and Concepts

Analysis

Some Consequences

Linked

Related