How to handle high multicollinearity for logistic regression and neural nets

Question

I am looking to fit multiple models (Random Forest (RF), Classification Trees (CT), Logistic Regression (LR) and Neural Networks (NN)) in order to predict if there will be avalanches during a day (0 or 1 dependant variable) based on meteorological variables (independent variables). I however create 100+ secondary features (e.g Tmax_24h, Tmax_48h, Tmin_24h, Tmax_48h, Rain_24h-48h-72h, Snow 24h-48h-72h, etc.) from 3 raw meteo variables (Air Temperature, precipitations and wind speed). Doing so introduce a lot of collinearity between my features. According to what I understand, having many collinear features doesn’t seems to be a problem for RF and CT, but it seems to be one for NN and LR (cannot converge when high collinearity exists between features). So I have a few questions:

Does RF and CT are really not affected by collinearity so I can give them the totality of my 100+ featuresand they will do great?
If answer to question 1 is YES, should I still run a feature elimination on my 100+ features before fitting a RF or CT even though they can manage collinearity by themselves?
Are LR and NN really affected by collinearity so I need to perform a feature selection on the 100+ variables before running them?
If answer to question 3 is NO, why do I get a convergence error whenI do so?
If answer to question 3 is YES, what is the best feature selection method for these 2 models (NN and LR)? What I tried for the moment is to run a LR backward elimination But LR being affected by collinearity, the computer isn’t able to converge because many features are collinear. I than tought about running a recursive feature elimination with cross validation (RFECV) on my 100+ features to reduce my dataset size based on a metric performance (let’s say I pass from 100+ to 20 features). But if I run the RFECV with LR, I still get a converging error (because it starts with the 100+ features). So, considering that RF aren’t affected by collinearity, I am now thinking of running the feature selection with random forest (RFECV-RF) before fitting my NN or LR models. I am however not sure it is legit to apply a feature selection based on RF in order to than fit the model with a NN or LR?

I am using the sklearn library in python. So any references to programming library would be appreciated in python as well. Thank you in advance!

Welcome to across Validated! Does this answer your question? [Ridge regression for multicollinearity and outliers](https://stats.stackexchange.com/questions/555145/ridge-regression-for-multicollinearity-and-outliers) — Dave, Dec 17 '21 at 12:33
It is effectively interesting, but I already tried to apply a L1 (LASSO) and L2 (ridge) penalty with the Logistic Regression [LR link] and I still had a converging error (like if it wasn't enough). Moreover, sklearn Neural Network tool does not seem to have a penalty hyperparameter like LR does [NN link] [LR link]:https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html [NN link]:https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html — Boocaj, Dec 17 '21 at 12:53

How to handle high multicollinearity for logistic regression and neural nets

0 Answers0