1

I am looking to fit multiple models (Random Forest (RF), Classification Trees (CT), Logistic Regression (LR) and Neural Networks (NN)) in order to predict if there will be avalanches during a day (0 or 1 dependant variable) based on meteorological variables (independent variables). I however create 100+ secondary features (e.g Tmax_24h, Tmax_48h, Tmin_24h, Tmax_48h, Rain_24h-48h-72h, Snow 24h-48h-72h, etc.) from 3 raw meteo variables (Air Temperature, precipitations and wind speed). Doing so introduce a lot of collinearity between my features. According to what I understand, having many collinear features doesn’t seems to be a problem for RF and CT, but it seems to be one for NN and LR (cannot converge when high collinearity exists between features). So I have a few questions:

  1. Does RF and CT are really not affected by collinearity so I can give them the totality of my 100+ featuresand they will do great?

  2. If answer to question 1 is YES, should I still run a feature elimination on my 100+ features before fitting a RF or CT even though they can manage collinearity by themselves?

  3. Are LR and NN really affected by collinearity so I need to perform a feature selection on the 100+ variables before running them?

  4. If answer to question 3 is NO, why do I get a convergence error whenI do so?

  5. If answer to question 3 is YES, what is the best feature selection method for these 2 models (NN and LR)? What I tried for the moment is to run a LR backward elimination But LR being affected by collinearity, the computer isn’t able to converge because many features are collinear. I than tought about running a recursive feature elimination with cross validation (RFECV) on my 100+ features to reduce my dataset size based on a metric performance (let’s say I pass from 100+ to 20 features). But if I run the RFECV with LR, I still get a converging error (because it starts with the 100+ features). So, considering that RF aren’t affected by collinearity, I am now thinking of running the feature selection with random forest (RFECV-RF) before fitting my NN or LR models. I am however not sure it is legit to apply a feature selection based on RF in order to than fit the model with a NN or LR?

I am using the sklearn library in python. So any references to programming library would be appreciated in python as well. Thank you in advance!

Boocaj
  • 31
  • 3
  • Welcome to across Validated! Does this answer your question? [Ridge regression for multicollinearity and outliers](https://stats.stackexchange.com/questions/555145/ridge-regression-for-multicollinearity-and-outliers) – Dave Dec 17 '21 at 12:33
  • It is effectively interesting, but I already tried to apply a L1 (LASSO) and L2 (ridge) penalty with the Logistic Regression [LR link] and I still had a converging error (like if it wasn't enough). Moreover, sklearn Neural Network tool does not seem to have a penalty hyperparameter like LR does [NN link] [LR link]:https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html [NN link]:https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html – Boocaj Dec 17 '21 at 12:53

0 Answers0