16

Say we have a binary classification problem with mostly categorical features. We use some non-linear model (e.g. XGBoost or Random Forests) to learn it.

  • Should one still be concerned about multi-collinearity? Why?
  • If the answer to the above is true, how should one fight it considering that one is using these types of non-linear models?
Josh
  • 3,408
  • 4
  • 22
  • 46

4 Answers4

10

Multi-collinearity will not be a problem for certain models. Such as random forest or decision tree. For example, if we have two identical columns, decision tree / random forest will automatically "drop" one column at each split. And the model will still work well.

In addition, regularization is a way to "fix" Multi-collinearity problem. My answer Regularization methods for logistic regression gives details.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
4

Late to the party, but here is my answer anyway, and it is "Yes", one should always be concerned about the collinearity, regardless of the model/method being linear or not, or the main task being prediction or classification.

Assume a number of linearly correlated covariates/features present in the data set and Random Forest as the method. Obviously, random selection per node may pick only (or mostly) collinear features which may/will result in a poor split, and this can happen repeatedly, thus negatively affecting the performance.

Now, the collinear features may be less informative of the outcome than the other (non-collinear) features and as such they should be considered for elimination from the feature set anyway. However, assume that the features are ranked high in the 'feature importance' list produced by RF. As such they would be kept in the data set unnecessarily increasing the dimensionality. So, in practice, I'd always, as an exploratory step (out of many related) check the pairwise association of the features, including linear correlation.

dnqxt
  • 571
  • 2
  • 8
  • 2
    I believe there are cases when multi-collinearity can be safely ignored, some of the cases are discussed here: https://statisticalhorizons.com/multicollinearity – Dr Nisha Arora Nov 26 '19 at 04:29
0
  1. Should one still be concerned about multi-collinearity? Why?

If the non-linear model is tree-based model, then you shouldn't consider it serious. Different tree model will have different deal method, such as the random forest will keep them both (because they build the tree independently, and random select the feature for every trees), but it have no effect about the prediction performance, even you remove the redundant one. But for xgboost, it will choose anyone of them, and use it until the last tree build.

  1. If the answer to the above is true, how should one fight it considering that one is using these types of non-linear models?

It just about the interpretation meaning, so remove the highly correlation variable is suggested.

wolfe
  • 571
  • 5
  • 10
-4

Multi-collinearity is always a possible problem. Variables that are predictors in the model will affect the prediction when they are linearly related (i.e., when collinearity is present).

Carl
  • 11,532
  • 7
  • 45
  • 102
Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • 1
    Thanks, if (1) the focus is prediction performance (and not the interpretability) and (2) the model is non-linear, would you mind elaborating on why this can still be a problem? (and how exactly it would manifest itself?) – Josh Mar 08 '17 at 20:23
  • These variables that are predictors in the model will affect the prediction when they are linearly related (i.e. collinearity is present). – Michael R. Chernick Mar 08 '17 at 20:26
  • 1
    Affect the prediction how, exactly? BTW, http://stats.stackexchange.com/a/138082/99274, put some links in your answer or face the wrath of the "been there, done that" crowd. – Carl Mar 08 '17 at 21:19
  • 7
    Since classification is so closely related to prediction, and prediction tends not to suffer from multicollinearity, it is important to support your contention that it's *always* a "possible problem," especially for the particular models mentioned in the question. What manner of problem would that be for classification and why? – whuber Mar 08 '17 at 21:21
  • It is possible to make a wrong classification when two or more of the variables used for the classifier are correlated. – Michael R. Chernick Mar 08 '17 at 22:12
  • 15
    Im pretty sure you're begging the question. Whuber asked why prediction suffers from multicollinearity, and you basically responded "Prediction suffers from multicollinearity because prediction suffers from multicollinearity." – Matthew Drury Mar 08 '17 at 22:48