3

Let's consider the regression $y=x_1+x_2+x_3+\varepsilon$

It is known that $x_2$ and $x_3$ affect $x_1$, but $x_2$ and $x_3$ do not affect $y$. $x_1$ can affect $y$, but only to a small extent. The RMSE is slightly lower if we add $x_2$ and $x_3$ compared to when regressing $y=x_1+\varepsilon$.

There is no multicollinearity. Given the goal that we want to estimate the effects of $x_1$ on $y$, what are the arguments for including or excluding $x_1$ and $x_2$ in the regressions? An argument in favor of adding $x_2$ and $x_3$: Is it that we can estimate the pure effects of $x_1$ on $y$?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Adam
  • 33
  • 4

1 Answers1

3

The situation you are referring to is called mediation.

Whether you should include $x_2$ and $x_3$ depends on what you want from your model. If you want to test models of complex relationships (e.g., comparing partial mediation to complete mediation), you may want to use structural equations modeling (SEM). If you simply want to predict $y$ as best as possible, $x_2$ and $x_3$ may or may not help. (Note that adding them has to reduce the RMSE, whether the variables are appropriate or not.)

Whether or not adding $x_2$ and $x_3$ helps you estimate the "pure" effect of $x_1$ depends on the nature of the data generating process. For example, they would 'purify' the relationship if they act as a suppressor. Suppression is a difficult and counter-intuitive topic, to start learning about it, you could try reading this excellent CV thread: Suppression effect in regression: definition and visual explanation/depiction (this or this may help as well).

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Thank you very much, that was a very quick reply and a detailed answer. I was unfamiliar with the term mediation. Also, I am going to read through that link. – Adam Jul 21 '15 at 20:21