0

I have this basic issue that has been long bugging me and I couldn't get my head wrapped around it. It is about the shared variability in linear regression when there are two or more variables that are correlated. This is a sort of follow-up to this great thread: Where is the shared variance between all IVs in a linear multiple regression equation?

In a nutshell: We all know that with multicollinearity, the variance of the coefficients inflates and the coefficients themselves are less accurate (they represent both unique and shared variability, but we don't really know how these are divided). We also know that in linear regression each beta represents the contribution of that variable when all other variables are fixed (i.e., "after controlling for all other variables"). How can these two facts co-exist?

Let me elaborate on that:

  1. In linear regression, we extract the beta weights, which are the contribution of each variable x in predicting y.
  2. For each x, the beta of x should reflect the contribution of x when all other independent variables are fixed.
  3. Technically, this is equivalent to regressing out all other independent variables from x and from y, and predicting the linear relationship between their residuals. An example with two independent variables: The beta of x1 in the model y~x1+x2 is identical to the beta in the model resid.y.x2~resid.x1.x2 (residuals of y and x1 after regressing x2 from each).
  4. When we regress out {x2,...,xn} from x1, we remove the contributions of {x2,...,xn} to x1. Hence, the residuals reflect x1 after removing its shared variability with all other variables. An example with two independent variables: Let's assume x1 and x2 significantly correlate to some extent cor(x1,x2)!=0. We can get rid of the variability that they share by regressing one over the other and taking the residuals.
  5. Hence, if both x1 and x2 are independent variables in a linear regression, the beta of each should reflect the unique variability after removing the shared variability from each.
  6. New fact: when there is multicollinearity (i.e., two or more independent variables correlate), the shared variability of the data is randomly assigned and divided between the correlated variables. The higher the correlation is between x1 and x2, the larger the shared part so the less we can trust the betas to reflect the unique contribution of x1 and x2.
  7. Points 5 and 6 seem to contradict each other.

I am sure that one of my assumptions or one of my logical entailments is wrong. Naturally, the more variables we add to a model the fit can only improve, so it can't be that the shared variability is simply "removed" from the model, which goes against 1-5. So I must have gotten something wrong (in particular, point 4 is my main suspect).

Where is the fallacy?

Galit
  • 107
  • 1
  • 9

0 Answers0