7

There's seems to be a bit like catch 22: suppose I am doing linear regression, and I have 2 variables that are highly correlated. If I use both in my model, I will suffer from multicollinearity, but if I don't put both I will suffer from omitted variable bias?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Maverick Meerkat
  • 2,147
  • 14
  • 27
  • 3
    Usually, you would not care about both of them simultaneously. Depending on the goal of your analysis (say, description vs. prediction), you would only care about one of them. For description, multicollinearity is just a fact to be mentioned, just one of the characteristics of the data. For prediction, omitted variable bias is largely irrelevant. – Richard Hardy Mar 14 '20 at 19:13
  • What do you think about my answer? Does it answer your question? If so, you may accept it by clicking on the tick mark to the left. Otherwise, you may ask for further clarification. This is [how Cross Validated works](https://stats.stackexchange.com/tour). – Richard Hardy May 02 '20 at 06:29
  • Even though it makes sense, I need to give it some deeper thought. So I won't be accepting it yet. – Maverick Meerkat May 02 '20 at 09:32
  • Sure. I have also given it more though and have appended my answer. – Richard Hardy May 02 '20 at 10:39
  • Richard is right in the sense that your goal matters but it seems there is confusion about "prediction". Predictions and model performance are generally considered unaffected by multicollinearity. Multicollinearity is a bigger concern when you want to describe the relationships in sample estimated by the beta coefficients or make inferences on the true values/relationships of the betas. – LSC May 02 '20 at 10:46

2 Answers2

5

Usually, you would not care about both of them simultaneously. Depending on the goal of your analysis (say, description vs. prediction vs. causal inference), you would care about at most one of them.

Description$\color{red}{^*}$
Multicollinearity (MC) is just a fact to be mentioned, just one of the characteristics of the data to report.
The notion of omitted variable bias (OVB) does not apply to descriptive modelling. (See the definition of OVB in the Wikipedia quote provided below.) In contrast to causal modelling, the causal notion of relevance of variables does not apply for description. You can freely choose the variables you are interested in describing probabilistically (e.g. in the form of a regression) and you evaluate your model w.r.t. the chosen set of variables, not variables not chosen.

Prediction
MC and OVB are largely irrelevant as you are not interested in model coefficients per se, only in predictions.

Causal modelling / causal inference
You may care about both MC and OVB at once when attempting to do causal inference. I will argue that you should actually worry about the OVB but not MC. OVB results from a faulty model, not from the characteristics of the underlying phenomenon. You can remedy it by changing the model. Meanwhile, imperfect MC can very well arise in a well specified model as a characteristic of the underlying phenomenon. Given the well specified model and the data that you have, there is no sound escape from MC. In that sense you should just acknowledge it and the resulting uncertainty in your parameter estimates and inference.

$\color{red}{^*}$I am not 100% sure about the definition of description / descriptive modelling. In this answer, I take description to constitute probabilistic modelling of data, e.g. joint, conditional and marginal distributions and their specific features. In contrast to causal modelling, description focuses on probabilistic but not causal relationships between variables.


Edit to respond to feedback by @LSC:

In defence of my statement that OVB is largely irrelevant for prediction, let us first see what OVB is. According to Wikipedia,

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to the estimated effects of the included variables. More specifically, OVB is the bias that appears in the estimates of parameters in a regression analysis, when the assumed specification is incorrect in that it omits an independent variable that is a determinant of the dependent variable and correlated with one or more of the included independent variables.

In prediction, we do not care about the estimated effects but rather accurate predictions. Hence, my statement above should become obvious.

Regarding the statement OVB will necessarily introduce bias into the estimation process and can screw with predictions by @LSC.

  • This is tangential to my points because I did not discuss the effect of omitting a variable on prediction. I only discussed the relevance of omitted variable bias for prediction. The two are not the same.
  • I agree that omitting a variable does affect prediction under imperfect MC. While this would not be called OVB (see the Wikipedia quote above for what OVB typically means), this is a real issue. The question is, how important is that under MC? I will argue, not so much.
  • Under MC, the information set of all the regressors vs. the reduced set without one regressor are close. As a consequence, the loss of predictive accuracy from omitting a regressor is small, and the loss shrinks with the degree of MC. This should come as no surprise. We are routinely omitting regressors in predictive models so as to exploit the bias-variance trade-off.
  • Also, the linear prediction is unbiased w.r.t. the reduced information set, and as I mentioned above, that information set is close to the full information set under MC. The coefficient estimators are also predictively consistent; see "T-consistency vs P-consistency" for a related point.
Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • "Prediction" as used by 99.9999% of statisticians would be a case where multicollinearity is generally irrelevant. "Inference" or even just describing the relationships estimated with beta coefficients is when multicollinearity matters more. Multicollinearity does not cause bias in the estimation process and therefore, prediction (predicted Y values or model performance) is almost always considered unaffected by multicollinearity. Omitting pertinent variables, though, causes misspecification and can introduce at least bias but possibly inconsistency into the estimation procedure. – LSC May 02 '20 at 10:43
  • @LSC, multicollinearity does not cause bias, but this applies equally to prediction and inference. Importantly, while multicollinearity causes high variance, some combinations of parameters have low variance, e.g. the linear combination that is used for prediction in linear regression. That is why multicollinearity does not matter much for prediction. Meanwhile, the high variance of individual parameters is a problem in inference, as the high uncertainty of the point estimates is undesirable. – Richard Hardy May 02 '20 at 10:53
  • @LSC, (continued) Meanwhile, omitted variables cause not only finite-sample, but also asymptotic bias (hence also inconsistency w.r.t. to the true parameter), except when the omitted variables are orthogonal to the space of the included regressors. – Richard Hardy May 02 '20 at 10:55
  • @LSC, you mentioned confusion regarding prediction in a comment above and stressed the 99%... bit, but I do not see any confusion. While none of us has specified the notion of prediction explicitly, we seem to agree well on the implications of multicollinearity w.r.t. it. – Richard Hardy May 02 '20 at 11:55
  • Why the downvote? I would appreciate some constructive feedback so that I can improve my answer. Thank you. – Richard Hardy May 02 '20 at 14:09
  • I'm not trying to be funny, but the feedback I gave is in my post. Prediction is unaffected because multicollinearity doesn't introduce bias and allows for the use of whatever variables are useful for prediction even if information is redundant. "Prediction" as you were using it seemed more imprecise and to imply interpreting betas or making inferences on betas which isn't really prediction in a statistical sense. My use of "confusion" was more of an oblique way to say I think the word is used wrong and this context warrants correct usage of the term. 1/n – LSC May 03 '20 at 02:02
  • 2/2 I also think your comment that for "For prediction, omitted variable bias is largely irrelevant as you are not interested in model's coefficients per se, only in predictions. " is wrong because OVB will necessarily introduce bias into the estimation process and can screw with predictions because the linear predictor is no longer unbiased. – LSC May 03 '20 at 02:04
  • @LSC, thank you for your elaboration. Let me try to explain my points in more detail. **(1)** I do not see how I might have implied prediction to deal with interpreting betas or making inferences on betas, because I specifically stated the opposite: *you are not interested in model's coefficients per se*. **(2)** I believe my statement *For prediction, omitted variable bias is largely irrelevant as you are not interested in model's coefficients per se, only in predictions* is also correct; I have elaborated on it by appending my post. – Richard Hardy May 03 '20 at 07:55
  • Maybe we're saying the same thing in different language, but I'm sure confused by your post. "For prediction, multicollinearity and omitted variable bias are largely irrelevant as you are not interested in model's coefficients per se, only in predictions." But you claim "This is tangential to my points because I did not discuss the effect of omitting a variable on prediction. I only discussed the relevance of omitted variable bias for prediction. The two are not the same." I think you should elaborate on how you are using these differently, because it's not clear to me here. 1/n – LSC May 03 '20 at 10:48
  • You definitely implied the effects of OVB on predictions by your verbiage, so please clarify your wording. If you're conceding OVB has an implication for prediction, then certainly it matters for prediction but to know how much is more challenging in any particular case because it depends on the degree of bias. This feels more like a word game to me, but so I think clarifying your point further would be helpful rather than just saying "these things aren't the same." 2/2 – LSC May 03 '20 at 10:51
  • And your original text was "For prediction, omitted variable bias is largely irrelevant as you are not interested in model's coefficients per se, only in predictions. " Which reads a lot like "biased estimation is irrelevant for predictions, so OVB doesn't matter if you're trying to get predictions." which would be not a great statement. This is mostly what I was commenting on at the origin of this discussion. – LSC May 03 '20 at 10:56
  • @LSC, Thank you for your comments! I have tried honestly to explain myself to a degree of sufficient detail. I think the best one can do now is carefully read what I have written and be extra careful about deriving implications. The ones you mention were not meant and do not follow from my statements. I have by now provided a definition of the OVB, and I have elaborated below on the tangential point which is not about OVB but about the effect of omitting variables on prediction. The confusion should clear up if one pays sufficient attention to definitions and my precise wording. – Richard Hardy May 03 '20 at 11:05
  • @RichardHardy I agree with you reply about prediction side. However I fear that your reply conflated causation and description role. I writed a related question here (https://stats.stackexchange.com/questions/464261/regression-causation-vs-prediction-vs-description) I would appreciate you reply there. – markowitz May 03 '20 at 13:51
  • OVB is more of an econometrics term. In traditional statistics, not many use "omitted variable bias" but talk about biased estimation and how omitting a relevant variable (nonzero beta and nonzero correlation with something else in the model) could cause this. I think your explanation of "this sounds the same but it's not" is lacking, but if you don't feel further explanation is needed, that's okay! I think we agreed on the important points of what MC or biased estimation can mean, just maybe some vocabulary differences. 1/2 – LSC May 03 '20 at 20:06
  • My point is that, to less technical readers, your point may be lost because they're asking the question in the first place and it may not necessarily follow what I have mentioned, but others can't necessarily pick that up, otherwise the question from OP wouldn't arise in the first place. 2/2 – LSC May 03 '20 at 20:07
1

If your goal is inference, multicollinearity is problematic. Consider multiple linear regression where the beta parameters help us estimate the increase or decrease in Y for a unit increase in X1, all other variables held constant. Multicollinearity has the effect of inflating the standard errors of the beta parameters, making such inferences less reliable. Specifically, the variances of the model coefficients become very large so that small changes in the data can precipitate erratic changes in model parameters.

If the purpose of the regression model is to investigate associations, multicollinearity among the predictor variables can obscure the computation and identification of key independent effects of collinear predictor variables on the outcome variable because of the overlapping information they share.

(source)

However, multicollinearity does not prevent good, reliable predictions in the scope of the model.

In general, multicollinearity is acceptable when the goal is prediction, but if multicollinearity is present, it is something you should disclose and it affects the uncertainty surrounding your model estimates.

Be aware that perfect multicollinearity actually leads to a situation in which an infinite number of fitted regression models is possible. The VIF (variance inflation factor) is one rule-of-thumb for how much multicollinearity we can tolerate in inference.

In a model with perfect multicollinearity, your regression coefficients are indeterminate and their standard errors are infinite

(source).

Timothy
  • 61
  • 3
  • Welcome to Cross Validated! Unfortunately, I think several of your statements are incorrect. Impecfect multicollinearity does not invalidate any assumptions. The coefficients do not lose interpretability or meaning either. The only thing that happens is that confidence bounds get wide. – Richard Hardy May 02 '20 at 10:58
  • @RichardHardy see my edit for sources. You are right that the confidence bounds get wide, and that is my meaning. The confidence bounds widen to infinity as the degree of multicollinearity increases, making coefficients unstable. Small changes in the data can cause coefficients to change erratically. – Timothy May 02 '20 at 11:12
  • Multicollinearity (mc) can be perfect or imperfect. The original post specifically discusses **imperfect** mc: *I have 2 variables that are highly correlated*. Thus my points. Now it is a little unclear which type of mc you are discussing because your description does not fit either type but is a mix of both. Perfect mc (which is irrelevant to the OP) yields unidentified and in a sense meaningless point estimates and undefined/infinite (not just large) standard errors. I think you could improve your answer by making the distinction specific and explicitly stating the case you are discussing. – Richard Hardy May 02 '20 at 11:51
  • @RichardHardy I agree I should clarify the distinction between perfect and imperfect mc. Thanks for your feedback. I will revise my answer accordingly. I wanted to bring light to this point from your comment: "the high variance of individual parameters is a problem in inference." I feel this is a key insight. But when you assert "For description, multicollinearity is just a fact to be mentioned" it might be helpful to clarify the thing about the variances, for those who are wondering why multicollinearity can be a bad thing. – Timothy May 02 '20 at 20:07