1

My intuition was that if an explanatory variable is independent of the response then in a multiple regression it should have a $\beta$ of zero.

Consider however the following very simple example: the distribution of $\left(Y,X_1,X_2\right)$ is multivariate normal, with a mean vector of $\mathbf{0}$, and a covariance matrix of $$\begin{pmatrix}10&2&0\\2&5&1\\0&1&1/2\end{pmatrix}.$$ Here the regression coefficients are $$\mathbf{\beta}=\begin{pmatrix}2&0\end{pmatrix}\begin{pmatrix}5&1\\1&1/2\end{pmatrix}^{-1}=\begin{pmatrix}\frac{2}{3}&-\frac{4}{3}\end{pmatrix},$$ i.e., $X_2$ has a non-zero $\beta$ despite being uncorrelated with $Y$, meaning independent from $Y$ in this case.

How can I image this?

I understand that this is a multivariate situation, so pairwise correlations are not conclusive (as the multivariate structure matters), but I thought that for multivariate normal, if I see a zero in the whole covariance matrix (and all variables are included in the regression) it just means that the $\beta$ needs to be zero.

Corollary question: if my intuition is not correct, then is the following statement true instead: ''In a multivariate normal model, a $\beta$ is zero iff the variable is uncorrelated with the response and it is also uncorrelated with all the remaining explanatory variables''...?

That's interesting, because it would mean that from the two conditions for omitted variable bias not to occur (the variable has a zero $\beta$ or it is uncorrelated with all the other variables) the first actually implies the second (in multivariate normal model, of course).

Tamas Ferenci
  • 3,143
  • 16
  • 26
  • A correct interpretation of a zero correlation coefficient is that the explanatory variable is uncorrelated with the *residuals* of the regression of the response on all the other explanatory variables. See https://stats.stackexchange.com/questions/46185 *inter alia.* – whuber Jun 10 '19 at 14:39
  • Hm. That doesn't seem to be the case! Or I misunderstood you, here is what I have tried: `library(MASS)` `SimData – Tamas Ferenci Jun 12 '19 at 20:37
  • You are correct: I misstated the idea. Let me restate it in the context of your (helpful) `R` illustration. First, 0 is the result of `with(SimData, cov(y,x2))` (up to sampling error, as always). Second, -4/3 is the result of (a) taking the effect of `x1` out of all the variables: `res.2 – whuber Jun 12 '19 at 22:05
  • 1
    Thank you! I think I can now at least phrase what confuses me: that *removing* an effect *introduces* an effect. $X_2$ is independent of $Y$, it has no effect on $Y$ (as evidenced by the covariance matrix, or a regression with $X_2$ alone as predictor). When you put $X_1$ into the regression as well, you remove its effect from $X_2$s effect. And now comes the (false) intuition: if $X_2$s effect is already nil, then removing anything from it can only result in still nil effect. I'm still struggling to understand that, either conceptually, or graphically (which should also be possible here...). – Tamas Ferenci Jun 13 '19 at 06:43
  • Some efforts have been made in other threads to present both conceptual and graphical explanations: see https://stats.stackexchange.com/questions/17336 and https://stats.stackexchange.com/questions/46185 *inter alia.* – whuber Jun 13 '19 at 12:32
  • @whuber I now see that the heart of my misunderstanding was that I assumed that the covariance matrix contains the *direct* effects, while in reality it contains the *total* effects. It is entirely possible that the *direct* relationship between $X_2$ and $Y$ ($-4/3$) is exactly opposite of the *indirect* relationship mediated through $X_1$ ($2 \cdot 2/3$), resulting in 0 *total* relationship (seen in the covariance matrix), but non-zero effect when controlling for $X_1$. – Tamas Ferenci Feb 21 '20 at 09:40

0 Answers0