How can you handle unstable $\beta$ estimates in linear regression with high multi-collinearity without throwing out variables?

Question

Beta stability in linear regression with high multi-collinearity?

Let's say in a linear regression, the variables $x_1$ and $x_2$ has high multi-collinearity (correlation is around 0.9).

We are concerned about the $\beta$ coefficient stability so we have to treat the multi-collinearity.

The textbook solution would be to just throw away one of variables.

But we don't want to lose useful information by simply throwing away variables.

Any suggestions?

Have you tried some kind of regularization scheme (e.g. ridge regression)? — Néstor, Jul 17 '12 at 15:38

Macro · Answer 1 · 2012-07-18T17:19:25.467

11

Well, there is one ad hoc method that I've used before. I'm not sure if this procedure has a name but it makes sense intuitively.

Suppose your goal is to fit the model

$$ Y_i = \beta_0 + \beta_1 X_i + \beta_2 Z_i + \varepsilon_i $$

where the two predictors - $X_i, Z_i$ - are highly correlated. As you've pointed out, using them both in the same model can do strange things to the coefficient estimates and $p$-values. An alternative is to fit the model

$$ Z_i = \alpha_0 + \alpha_1 X_i + \eta_i $$

Then the residual $\eta_i$ will be uncorrelated with $X_i$ and can, in some sense, be thought of as the part of $Z_i$ that is not subsumed by its linear relationship with $X_i$. Then, you can proceed to fit the model

$$ Y_i = \theta_0 + \theta_1 X_i + \theta_2 \eta_i + \nu_i $$

which will capture all of the effects of the first model (and will, indeed, have the exact same $R^2$ as the first model) but the predictors are no longer collinear.

Edit: The OP has asked for an explanation of why the residuals do not, definitionally, have a sample correlation of zero with the predictor when you omit intercept like they do when the intercept is included. This is too long to post in the comments so I made an edit here. This derivation is not particularly enlightening (unfortunately I couldn't come up with a reasonable intuitive argument) but it does show what the OP requested:

When the intercept is omitted in simple linear regression, $\hat \beta = \frac{ \sum x_i y_i}{\sum x_i^2}$, so $e_i = y_i - x_i \frac{ \sum x_i y_i}{\sum x_i^2}$. The sample correlation between $x_i$ and $e_i$ is proportional to $$\overline{xe} - \overline{x}\overline{e}$$ where $\overline{\cdot}$ denotes the sample mean of the quantity under the bar. I'll now show this is not necessarily equal to zero.

First we have

$$\overline{xe} = \frac{1}{n} \left( \sum x_i y_i - x_{i}^2 \cdot \frac{ \sum x_i y_i}{\sum x_i^2} \right) = \overline{xy} \left( 1 - \frac{ \sum x_{i}^2}{ \sum x_{i}^2 } \right) = 0$$

but

$$\overline{x} \overline{e} = \overline{x} \left( \overline{y} - \frac{\overline{x} \cdot \overline{xy}}{\overline{x^2}} \right) = \overline{x}\overline{y} - \frac{\overline{x}^2 \cdot \overline{xy}}{\overline{x^2}}$$

so in order for the $e_i$ and $x_i$ to have a sample correlation of exactly 0, we need $\overline{x}\overline{e}$ to be $0$. That is, we need $$ \overline{y} = \frac{ \overline{x} \cdot \overline{xy}}{\overline{x^2}} $$

which does not hold in general for two arbitrary sets of data $x, y$.

edited Jul 18 '12 at 17:19

answered Jul 17 '12 at 15:46

Macro

40,561
8
143
148

This reminds me of [partial regression](http://en.wikipedia.org/wiki/Partial_regression_plot) plots. – Andy W Jul 17 '12 at 15:58
3

This sounds like an approximation to replacing $(X, Z)$ by their principal components. – whuber Jul 17 '12 at 16:15
@whuber, yes but I think this is simpler since all you need to know is linear regression for this to make sense and you still (like the PCs) have the nice property that $\hat \theta_1$ and $\hat \theta_2$ are not correlated, unlike if you'd entered $X_i,Z_i$ into the model directly. Perhaps I'm missing your point. – Macro Jul 17 '12 at 16:18
3

One thing I had in mind is that PCA generalizes easily to more than two variables. Another is that it treats $X$ and $Z$ symmetrically, whereas your proposal appears arbitrarily to single out one of these variables. Another thought was that PCA provides a disciplined way to reduce the number of variables (although one must be cautious about that, because a small principal component may be highly correlated with the dependent variable). – whuber Jul 17 '12 at 16:22
@whuber, I see your point now but all else held equal I guess I'd prefer this, particularly since the problem description restricted us to the convenient (but not implausible) world where there are only two collinear predictors :) My reason is that I've always found the inability to intuitively interpret PCs as predictors to be a drawback. In this case, you at least can interpret $\hat \theta_1$ in the usual way and can, with some qualification, interpret $\hat \theta_2$ as well. – Macro Jul 17 '12 at 16:24
Thank you Macro. In your regression of Zi fitting onto Xi, why do you include the intercept? – Luna Jul 17 '12 at 18:32
@Luna, that is to guarantee that $\eta_i$ is uncorrelated with $X_i$ and so that $\eta_i$ can be interpreted as "the part of $Z_i$ not explained by linear relationship with $X_i$" – Macro Jul 17 '12 at 18:33
@Macro: if you don't include the intercept alpha_0 there, you can get the same results, right? – Luna Jul 17 '12 at 20:16
@Luna, no, I don't think so. If you exclude $\alpha_0$ from the model then $\eta_i$ and $X_i$ will have a non-zero correlation. The logic underlying my answer will only apply if you keep the intercept. – Macro Jul 17 '12 at 20:18
Why will there be a non-zero correlation? Thank you! – Luna Jul 17 '12 at 22:28
@Luna, I guess my response is - why wouldn't there be a non-zero correlation? In the case of regression **with** the intercept, the residuals are exactly orthogonal to the space spanned by $X$, therefore the sample correlation between the residuals and any predictor will be exactly 0 - see this figure: http://en.wikipedia.org/wiki/File:OLS_geometric_interpretation.svg but when you remove the intercept that is no longer true. – Macro Jul 17 '12 at 22:59
Hi Macro, if you throw in an intercept term, then the residuals are orthogonal to the space spanned by X and the 1's. If there is no intercept term, then the residuals are orthogonal to the space spanned by X alone. So my question is: why do you want to include the intercept term? Thank you! – Luna Jul 18 '12 at 15:47
@Luna, please see my edit. – Macro Jul 18 '12 at 17:13
@Macro: Regarding your edit, geometrically it's "easy" to see. The residuals are perpendicular to $(\mathbf 1, \mathbf x)$ when an intercept is included and $\mathbf x - \bar x \mathbf 1$ lies in the subspace generated by the aforementioned pair. If there is no intercept in the model then the residuals are no longer perpendicular to this subspace---the vector has been tilted relative to the subspace and so there is no longer a right angle between them. – cardinal Jul 18 '12 at 17:44
@cardinal, yes of course, I clumsily (i.e. inprecisely to the point of possibly being wrong) was trying to say that in a chat with Luna. While making this derivation, the thing that threw me was that $\sum x_i e_i = 0$ in the model with no intercept, meaning the residuals _are_ orthogonal to the space defined by $bx$ (**not** the space defined by $a+bx$). But, in that case, it's not equivalent to the correlation being zero - it took me a minute to see that. – Macro Jul 18 '12 at 17:48
1

Hi Macro, Thank you for the excellent proof. Yeah now I understand it. When we talk about sample correlation between x and residuals, it requires the intercept term to be included for the sample correlation to be 0. On the other hand, when we talk about orthogonality between x and residuals, it doesn't require the intercept term to be included, for the orthogonality to hold. – Luna Jul 18 '12 at 17:55
@Luna, yes, I guess what I learned is that you need to bear in mind what space you're referring to when you say that $x$ is orthogonal to something :) Glad I (and cardinal) could help. – Macro Jul 18 '12 at 18:00
Thanks Macro. Could you please shed some lights on why this approach might be better than Ridge Regression proposed above? Thank you! – Luna Jul 19 '12 at 14:21
1

@Luna, I don't particularly disagree with using ridge regression - this was just what first occurred to me (I answered before that was suggested). One thing I can say is that ridge regression estimate are biased, so, in some sense, you're actually estimating a slightly different (shrunken) quantity than you are with ordinary regression, making the interpretation of the coefficients perhaps more challenging (as gung alludes to). Also, what I've described here only requires understanding of basic linear regression and may be more intuitively appealing to some. – Macro Jul 19 '12 at 14:35

score 11 · Answer 2 · answered Jul 17 '12 at 15:56

11

You can try ridge regression approach in the case when correlation matrix is close to singular (i.e. variables have high correlations). It will provide you with a robust estimate of $\beta$.

The only question is how to choose regularization parameter $\lambda$. It is not a simple problem, though I suggest to try different values.

Hope this helps!

answered Jul 17 '12 at 15:56

Paul

624
4
8

2

Cross-validation is the usual thing to do to choose $\lambda$ ;-). – Néstor Jul 17 '12 at 16:03
1

indeed (+1 for answer and Nestors' comment), and if you perform the calculations in "canonical form" (using an eigen decomposition of $X^TX$, you can find the $\lambda$ minimising the leave-one-out cross-validation error by Newton's method very cheaply. – Dikran Marsupial Jul 17 '12 at 17:02
thanks a lot! Any tutorial/notes for how to do that including the cross-validation in R? – Luna Jul 17 '12 at 17:44
Check out chapter 3 in this book: http://www.stanford.edu/~hastie/local.ftp/Springer/ESLII_print5.pdf. The implementation of ridge regression is done in R by some of the authors (Google is your friend!). – Néstor Jul 17 '12 at 17:52
2

You can use the `lm.ridge` routine in the MASS package. If you pass it a range of values for $\lambda$, e.g., a call like `foo – jbowman Jul 17 '12 at 17:59
Thanks folks. Could you please shed some lights on why Ridge Regression might be better than the other approaches(for example, Macro's approach below)? Thank you! – Luna Jul 19 '12 at 14:20
Via ridge, all coefficients of correlated variables will be shrunken, so $\beta$ estimates shall be stable. But should we remove any of the collinear variables? – avocado Jun 03 '14 at 02:39

score 5 · Answer 3 · edited Apr 13 '17 at 12:44

I like both of the answers given thus far. Let me add a few things.

Another option is that you can also combine the variables. This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. This would be a good approach when you believe they are two different measures of the same underlying construct. In that case, you have two measurements that are contaminated with error. The most likely true value for the variable you really care about is in between them, thus averaging them gives a more accurate estimate. You standardize them first to put them on the same scale, so that nominal issues don't contaminate the result (e.g., you wouldn't want to average several temperature measurements if some are Fahrenheit and some are Celsius). Of course, if they are already on the same scale (e.g., several highly-correlated public opinion polls), you can skip that step. If you think one of your variables might be more accurate than the other, you could do a weighted average (perhaps using the reciprocals of the measurement errors).

If your variables are just different measures of the same construct, and are sufficiently highly correlated, you really could just throw one out without losing much information. As an example, I was actually in a situation once, where I wanted to use a covariate to absorb some of the error variance and boost power, but where I didn't care about that covariate--it wasn't germane substantively. I had several options available and they were all correlated with each other $r>.98$. I basically picked one at random and moved on, and it worked fine. I suspect I would have lost power burning two extra degrees of freedom if I had included the others as well by using some other strategy. Of course, I could have combined them, but why bother? However, this depends critically on the fact that your variables are correlated because they are two different versions of the same thing; if there's a different reason they are correlated, this could be totally inappropriate.

As that implies, I suggest you think about what lies behind your correlated variables. That is, you need a theory of why they're so highly correlated to do the best job of picking which strategy to use. In addition to different measures of the same latent variable, some other possibilities are a causal chain (i.e., $X_1\rightarrow X_2\rightarrow Y$) and more complicated situations in which your variables are the result of multiple causal forces, some of which are the same for both. Perhaps the most extreme case is that of a suppressor variable, which @whuber describes in his comment below. @Macro's suggestion, for instance, assumes that you are primarily interested in $X$ and wonder about the additional contribution of $Z$ after having accounted for $X$'s contribution. Thus, thinking about why your variables are correlated and what you want to know will help you decide which (i.e., $x_1$ or $x_2$) should be treated as $X$ and which $Z$. The key is to use theoretical insight to inform your choice.

I agree that ridge regression is arguably better, because it allows you to use the variables you had originally intended and is likely to yield betas that are very close to their true values (although they will be biased--see here or here for more information). Nonetheless, I think is also has two potential downsides: It is more complicated (requiring more statistical sophistication), and the resulting model is more difficult to interpret, in my opinion.

I gather that perhaps the ultimate approach would be to fit a structural equation model. That's because it would allow you to formulate the exact set of relationships you believe to be operative, including latent variables. However, I don't know SEM well enough to say anything about it here, other than to mention the possibility. (I also suspect it would be overkill in the situation you describe with just two covariates.)

Re the first point: Let vector $X_1$ have a range of values and let vector $e$ have small values completely uncorrelated with $X_1$ so that $X_2=X_1+e$ is highly correlated with $X_1$. Set $Y=e$. In the regression of $Y$ against either $X_1$ or $X_2$ you will see no significant or important results. In the regression of $Y$ against $X_1$ and $X_2$ you will get an *extremely* good fit, because $Y=X_2-X_1$. Thus, if you throw out either of $X_1$ or $X_2$, you will have lost essentially *all* information about $Y$. Whence, "highly correlated" does not mean "have equivalent information about $Y$". — whuber, Jul 18 '12 at 20:09
Thanks a lot Gung! Q1. Why does this approach work: "This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. "? Q2. Why would Ridge Regression be better? Q3. Why would SEM be better? Anybody please shed some lights on this? Thank you! — Luna, Jul 19 '12 at 14:22
Hi Luna, glad to help. I'm actually going to re-edit this; @whuber was more right than I had initially realized. I'll try to put in more to help w/ your additional questions, but it'll take a lot, so it might be a while. We'll see how it goes. — gung - Reinstate Monica, Jul 19 '12 at 14:32

How can you handle unstable $\beta$ estimates in linear regression with high multi-collinearity without throwing out variables?

3 Answers3

Linked