Intuition for consequences of multicollinearity

Question

So we have a regression equation with one explained variable and 10 explanatory variables.

What I have read so far:

Multicollinearity doesnt affect the regression of the model as a whole.
But if we start looking at the effect of individual predictor variable Xs on the explained variable, then we are going to have inaccurate estimates.

I have tried to start thinking intuitively about it like follows:

High Multicollinearity means that in a matrix, two or more rows/columns are linearly dependent on each other. In other words, in a 3-dimensional space there are 2 vectors which have (almost) the same direction only different magnitudes (is this right?)

I'd appreciate it if someone could explain how this translates into "multicollinearity not affecting the regression as a whole but only individual variable's coefficient estimates".

Also, could someone explain the statement in bold? I cant make sense out of it:

One other thing to keep in mind is that the tests on the individual coefficients each assume that all of the other predictors are in the model. In other words each predictor is not significant as long as all of the other predictors are in the model. There must be some interaction or interdependence between two or more of your predictors.

which was an answer to this question: How can a regression be significant but all predictors insignificant?

Your point (1) relates to _complete_ collinearity, where there are indeed redundant predictors which doesn't change R-sq. Point (2) is about _near_ collinearity, and it is about unstable estimates of population parameters: everything, including R-sq, gets affected. Read, for example, [this](http://stats.stackexchange.com/a/70910/3277) answer. — ttnphns, Oct 27 '13 at 07:56
It is curious that in your last line you link to my answer which demonstrates that the quotation is wrong! The quotation comes from a [different answer](http://stats.stackexchange.com/a/14525). (The vote counts might mean something.) Incidentally, multicollinearity is more general than your characterization in the bulleted item. For instance, in three dimensions you can place three collinear vectors at 120 degree angles from each other: they only have to lie in a common plane. Because 120 and 180-120=60 degrees are so large, no two of those vectors point in anywhere near the same directions. — whuber, Oct 27 '13 at 16:24
@ttnphns thanks for the link, highly appreciate the geometrical approaches which you seem to love taking in some of your answers! — Siddharth Gopi, Oct 28 '13 at 01:00

score 12 · Accepted Answer · answered Oct 27 '13 at 07:54

Let us first distinguish between perfect multi-collinearity (model matrix not of full rank, so that usual matrix inversions fail. Usually due to misspecification of the predictors) and non-perfect multi-collinearity (some of the predictors are correlated without leading to computational problems). This answer is about the second type, which occurs in almost any multivariable linear model since the predictors have no reason to be uncorrelated.

A simple example with strong multi-collinearity is a quadratic regression. So the only predictors are $X_1 = X$ and $X_2=X^2$:

set.seed(60)

X1 <- abs(rnorm(60))
X2 <- X1^2
cor(X1,X2)   # Result: 0.967

This example illustrates your questions/claims:

1. Multicollinearity doesnt affect the regression of the model as a whole.

Let's have a look at an example model:

Y <- 0.5*X1 + X2 + rnorm(60)
fit <- lm(Y~X1+X2)
summary(fit)

#Result
[...]

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.3439     0.3214  -1.070    0.289
X1            1.3235     0.8323   1.590    0.117
X2            0.5861     0.3931   1.491    0.141

Residual standard error: 1.014 on 57 degrees of freedom
Multiple R-squared:  0.7147,    Adjusted R-squared:  0.7047 
F-statistic: 71.39 on 2 and 57 DF,  p-value: 2.996e-16

Global statements about the model are just fine:

R-Squared: $X$ explains about 71% of the variability of $Y$
Global F-test: At the 5% level, there is really an association between $X$ and $Y$
Predictions: For persons with $X$-value 2, a best guess for his $Y$-value is $$ -0.3439 + 1.3235\cdot 2 + 0.5861 \cdot 2^2 = 4.6475 $$

2. But if we start looking at the effect of individual variable Xs on the explained variable, then we are going to have inaccurate estimates.

The estimates are accurate, this is not the problem. The problem with the standard interpretation of isolated effects is that we hold all other predictors fixed, which is strange if there are strong correlations to those other predictors. In our example it is even wrong to say "the average $Y$ value increases by 1.3235 if we increase $X_1$ by 1 and hold $X_2$ fixed, because $X_2 = X_1^2$. Since we cannot interpret isolated effects descriptively, also all inductive statements about them are not useful: Look at the t-tests in the output. Both are above the 5% level, although the global test of association gives us a p-value below 5%. The null hypothesis of such a t-test is "the effect of the predictor is zero" or, in other words, "the inclusion of this predictor does not increase the true R-squared in the population". Because $X_1$ and $X_2$ are almost perfectly correlated, the model has almost the same R-squared if we drop one of the two variables:

summary(lm(Y~X1))

# Gives

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.7033     0.2148  -3.274  0.00179 ** 
X1            2.5232     0.2151  11.733  < 2e-16 ***

Residual standard error: 1.025 on 58 degrees of freedom
Multiple R-squared:  0.7036,    Adjusted R-squared:  0.6985 
F-statistic: 137.7 on 1 and 58 DF,  p-value: < 2.2e-16

This already illustrates the first part of the statement:

One other thing to keep in mind is that the tests on the individual coefficients each assume that all of the other predictors are in the model. In other words each predictor is not significant as long as all of the other predictors are in the model. There must be some interaction or interdependence between two or more of your predictors.

The last statement here is plainly wrong.

The statement "There must be some sort of interaction or interdependence between two or more of your predictors" is accurate, I think, but "interdependence" is subject to multiple interpretations. If interpreted causally, then it is wrong. But I think the authors simply meant it as "relationship" — Peter Flom, Oct 27 '13 at 11:48
@PeterFlom: The global F-test has more power to detect linear relationships with the response than individual t-tests. So it might happen that every t-test has p value right above the sign. level but the global test has a p value just below it. — Michael M, Oct 27 '13 at 12:30
@Peter [My answer to a related question](http://stats.stackexchange.com/a/14528/919) specifically shows that statement about "some sort of interaction or independence" is unjustified. Michael is correct to call out that error in the last line here. — whuber, Oct 27 '13 at 16:28
Thanks for the clear explanation Michael, could you please, kindly explain your previous comment "@PeterFlom: The global F-test has more power to detect linear relationships with the response than individual t-tests. So it might happen that every t-test has p value right above the sign. level but the global test has a p value just below it. –" What is 'above the sign'? — Siddharth Gopi, Oct 28 '13 at 00:57
@garciaj: very welcome! 'Sign.' was short for 'significance'. :-) — Michael M, Oct 28 '13 at 06:24
In your example `X2=X1^2`, X1 and X2 are not `linearly` dependent . But definition of collinearity requires `linear` relationship. — ABC, Mar 29 '15 at 10:49

score 4 · Answer 2 · answered Oct 27 '13 at 11:57

4

Another problem, in addition to those @Michael gave, is that when there is strong near-colinearity, small changes in the input data can lead to large changes in the output.

I made up some data (taking wild guesses at the average lengths of legs and torso (in inches) and weight (in pounds) for adult humans).

set.seed(1230101)
lengthleg <- rnorm(100, 30, 5)
lengthtorso <- lengthleg + rnorm(100, 0, 1)
weight <- 1.2*lengthleg + 1.8*lengthtorso + rnorm(100, 0, 10)

m1 <- lm(weight~lengthleg + lengthtorso)
coef(m1)

the first time through, I got coefficients of -5.93, 0.43 and 2.73. Rerunning everything except set.seed gave me -9.91, 1.12 and 2.18.

answered Oct 27 '13 at 11:57

Peter Flom

94,055
35
143
276

In what sense are the changes that occur when re-running this code "small"? The size of a change is not an absolute but must be understood within some relevant context. You also seem to assume the changes in output are "large," but the same criterion applies: large with respect to what? Here, the basis for comparison should be the covariance matrix of the estimates. What does it say about the sizes of the changes in these coefficients? (BTW, your sample data seem to describe the Brobdingnags, certainly not humans: these people include some 25 footers!) – whuber Oct 27 '13 at 16:30
Sorry about the Brobdingnags, I was just playing. "Large" and "small" are always context sensitive, it's true. – Peter Flom Oct 27 '13 at 17:10
`In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data.` You showed it in your answe empirically . But is there other way to think the logic behind it ? – ABC Mar 29 '15 at 10:45

Intuition for consequences of multicollinearity

2 Answers2

Linked

Related