I have 27 features and I'm trying to predict continuous values. When I calculated the VIF (VarianceInflation Factors), only 8 features are less than 10 and the remaining features range from 10 to 250. Therefore, I am facing a multicollinearity issue. My work is guided by two aims: 1- ML models should be used to predict the values using regression algorithms. 2- To determine the importance of features( interpreting the ML models). A variety of machine learning algorithms have been applied, including Ridge, Lasso, Elastic Net, Random Forest Regressor, Gradient Boosting Regressor, and Multiple Linear Regression. Random Forest Regressor and Gradient BoostingRegresso showing the best performance (Lowest RMSE), while using only 10 features (out of 27 features) based on the feature importance results. As I understand it, if I face multicollinearity issues, I can fix them using regularized Regression models like LASSO. When I applied Lasso to my model, the evaluation result is not as good as Random Forest Regressor and Gradient BoostingRegresso. However, none of my coefficients become zero when I apply the feature importance. Moreover, I want to analyse which feature is affecting my target value and I do not want to omit my features. I was wondering if anyone could help me determine which of these algorithms would be good to use and why?
Asked
Active
Viewed 57 times
0
-
some of my features are the mitigation policies implemented during the pandemic to control the spread of COVID-19 disease. A country that I do analysis for implemented many mitigation policies at the same time, so these features are positively correlated to each other (almost 80%). When the coefficients become 0 as a result of the Random forest regression model, I could remove these features. At the same time, I am curious to realize which policy was affected my target variable most, and I am not sure how I should deal with multicollinearity in my case. – Negin Zarbakhsh Jul 21 '21 at 18:42
-
closely related https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-checked-in-modern-statistics-machine-learning/168631#168631 – Sycorax Jul 21 '21 at 18:43