Multicollinearity and predictive performance

Question

Looking at this statement:

"Multicollinearity does not affect the predictive power but individual predictor variable’s impact on the response variable could be calculated wrongly."

Is this contradictory? As parameters of independent variables are estimated wrongly, does this not affect predictive performance?

If I just care about predictive performance and do not care about the inferential nature of the model, do I have to care about multicollinearity? I guess as soon as I can accept 'black boxness' I should use more powerful 'non parametric' models with no assumptions anyway ...?

Unrelated to my answer below but related to your final point: it is not true that non-parametric methods have no assumptions. — mkt, Aug 10 '18 at 09:15

score 16 · Accepted Answer · answered Aug 10 '18 at 09:11

Let's assume that you have trained a model on a training dataset, and want to predict some values in a test/holdout dataset. Multicollinearity in your training dataset should only reduce predictive performance in the test dataset if the covariance between variables in your training and test datasets is different. If the covariance structure (and consequently the multicollinearity) is similar in both training and test datasets, then it does not pose a problem for prediction. Since a test dataset is typically a random subset of the full dataset, it's generally reasonable to assume that the covariance structure is the same. Therefore, multicollinearity is typically not an issue for this purpose.

Let's take a simple example. Suppose you want to predict the heights of a group of people based on some other variables: weight, arm length, leg length, etc. Unsurprisingly, you find that these variables are all strongly correlated in your training dataset. But if you can assume that arm lengths, leg lengths, weight, etc. are similarly correlated in both training and test datasets, you can go ahead and use them to predict heights of people successfully in your test dataset. If for some reason your test dataset has a different covariance structure (suppose it contains a bunch of basketball players with long arms), then your predictions will not be good.

As for why multicollinearity is not a problem for prediction but is a problem for inference: let's take the extreme case of 2 variables x1 and x2 that are perfectly correlated (i.e. r = 1). When used separately in 2 regressions to predict a variable y, both therefore return identical coefficient values - let's say the coefficient value is 3 in both cases.

When both x1 and x2 are used together in a multiple regression to predict y, there are now an infinite range of possible coefficient combinations that are equally valid. For example, the coefficient for x1 can be 3 and the coefficient for x2 can be 0. The reverse is equally valid: the coefficient for x1 can be 0 and the coefficient for x2 can be 3.

This leads to massive uncertainty from the perspective of inference, because each individual parameter is poorly constrained. But importantly, despite the huge variation in x1 and x2 across this hypothetical set of models, all the models return identical predictions for y. So from the perspective of prediction, all these models are equivalent. If all you want to do is predict some new values, you can choose any of these models - assuming of course that x1 and x2 are still perfectly correlated in your test dataset.

Ok thanks that makes sense. Great example by the way! – cs0815 Aug 10 '18 at 09:17 — cs0815, Aug 10 '18 at 09:17

Multicollinearity and predictive performance

1 Answers1

Linked