We can conceive of an "interaction" between regressor variables $x_1$ and $x_2$ as a departure from a perfectly linear relationship in which the relationship between one regressor and the response is different for different values of the other regressors. The usual "interaction term" is, in a sense to be explained below, a "simplest" such departure.
Definitions and Concepts
"Linear relationship" simply means the usual model in which we suppose a response $Y$ differs from a linear combination of the $x_i$ (and a constant) by independent, zero-mean errors $\varepsilon:$
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon.\tag{*}$$
"Interaction," in the most general sense, means the parameters $\beta_i$ may depend on other variables.
Specifically, in this example of just two regressors, we might generically write
$$\beta_1 = \beta_1(x_2)\text{ and }\beta_2 = \beta_2(x_1).$$
Analysis
Now, in practice, nobody except a theoretical physicist really believes model $(*)$ is fully accurate: it's an approximation to the truth and, we hope, a close one. Pursuing this idea further, we might ask whether we could similarly approximate the functions $\beta_i$ with linear ones in case we need to model some kind of interaction. Specifically, we could try to write
$$\beta_1(x_2) = \gamma_0 + \gamma_1 x_2 + \text{ tiny error}_1;$$
$$\beta_2(x_1) = \delta_0 + \delta_1 x_1 + \text{ tiny error}_2.$$
Let's see where that leads. Plugging these linear approximations into $(*)$ gives
$$\eqalign{
Y &= \beta_0 + \beta_1(x_2) x_1 + \beta_2(x_1) x_2 + \varepsilon \\
&= \beta_0 + (\gamma_0 + \gamma_1 x_2 + \text{ tiny error}_1)x_1 + (\delta_0 + \delta_1 x_1 + \text{ tiny error}_2)x_2 + \varepsilon \\
&= \beta_0 + \gamma_0 x_1 + \delta_0 x_2 + (\gamma_1 + \delta_1)x_1 x_2 + \ldots
}$$
where "$\ldots$" represents the total error,
$$\ldots = (\text{ tiny error}_1)x_1 + (\text{ tiny error}_2)x_2 + \varepsilon.$$
With any luck, multiplying those two "tiny errors" by typical values of the $x_i$ will either (a) be inconsequential compared to $\varepsilon$ or (b) can be treated as random terms which, when added to $\varepsilon$ (and maybe adjusting the constant term $\beta_0$ to accommodate any systematic bias) can be treated as a random error term.
In either case, with a change of notation we see that this linear-approximation-to-an-interaction model takes the form
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1 x_2 + \varepsilon,\tag{**}$$
which is precisely the usual "interaction" regression model. (Note that none of the new parameters, nor $\varepsilon$ itself, is the same quantity originally represented by those terms in $(*).$)
Observe how $\beta_{12}$ arises through variation in both the original parameters. It captures the combination of (i) how the coefficient of $x_1$ depends on $x_2$ (namely, through $\gamma_1$) and (ii) how the coefficient of $x_2$ depends on $x_1$ (through $\delta_1$).
Some Consequences
It is a consequence of this analysis that if we fix all but one of the regressors, then (conditionally) the response $Y$ is still a linear function of the remaining regressor. For instance, if we fix the value of $x_2,$ then we may rewrite the interaction model $(**)$ as
$$Y = (\beta_0 + \beta_2 x_2) + (\beta_1 + \beta_{12} x_2) x_1 + \varepsilon,$$
where the intercept is $\beta_0 + \beta_2 x_2$ and the slope (that is, the $x_1$ coefficient) is $\beta_1 + \beta_2 x_2.$ This allows for easy description and insight. Geometrically, the surface given by the function
$$f(x_1,x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1x_2$$
is ruled: when we slice it parallel to either of the coordinate axes, the result is always a line. (However, the surface itself is not planar except when $\beta_{12}=0.$ Indeed, it everywhere has negative Gaussian curvature.)
Finally, if our hope for (a) or (b) does not pan out, we might further expand the functional behavior of the original $\beta_i$ to include terms of second order or higher. Carrying out the same analysis shows this will introduce terms of the form $x_1^2,$ $x_2^2,$ $x_1x_2^2,$ $x_1^2x_2,$ and so forth into the model. In this sense, including a (product) interaction term is merely the first--and simplest--step towards modeling nonlinear relationships between the response and the regressors by means of polynomial functions.
Finally, in his textbook EDA (Addison-Wesley 1977), John Tukey showed how this approach can be carried out far more generally. After first "re-expressing" (that is, applying suitable non-linear transformations to) the regressors and the response, it often is the case that either model $(*)$ applies to the transformed variables or, if not, model $(**)$ can easily be fit (using a robust analysis of residuals). This allows for a huge variety of nonlinear relationships to be expressed and interpreted as conditionally linear responses.