LASSO or random forest (RF) to use for variable selection when having highly correlated features in a relatively small dataset with many features?

Question

I have a cross sectional data-set with around 1000 features and 5000 observations. There are many features (no categorical features) which are highly correlated (higher than 0.85). I want to decrease my feature set before modelling. I know that LASSO can be used to shrink feature set since it can set coefficients to zero depending on the penalization weight. However, under the presence of highly correlated features it can select irrelevant one.

On the other hand, as far as I know, if I use RF (with H2O), the effect of correlated features are diluted. In sklearn this is not an issua as explained here However, RF results are unstable since I have quite noisy data (i.e. every run without changing anything results in different feature set).

Considering that LASSO gives stable results for the same data-set, first I am planning to use it to shrink the feature set (from 1000 to 100) and then apply RF for the variable importance.

Does this approach make sense? If not, what would you suggest? Lastly, I don't want to apply PCA since I need interpretation of variable importance.

[tag:Boruta] is a principled way to use random forest to screen out irrelevant predictors. — Sycorax, Apr 04 '19 at 15:50

score 1 · Accepted Answer · answered Apr 04 '19 at 15:45

Your approach should work, but its a little complicated. An alternative option I've used before is to iteratively train RFs and each time drop the feature with the lowest importance. Track performance to make sure it doesn't drop very much, and repeat this process until you have an acceptable number of features.

You state that the RF results are unstable. To increase stability you can increase your number of trees (try 2.5K or more). Also, by stability, is it just the ranks that are jumping around, or the importance values? For instance, if the importance values of your worst 300 features are all in a very small range, then assessing their rank as a metric for RF stability wouldn't be very useful. It would be better to measure the correlation between importance scores from one run to the next.

If you follow the above approach then you will still end up with some highly correlated features, but it seems like this might be ideal for you since you state that LASSO sometimes chooses the 'irrelevant' one.

score 1 · Answer 2 · answered Apr 05 '19 at 05:02

One option is to use variable clustering to establish groups of variables from which you can select a representative. This is a sort of compromise version of PCA: you get oblique components that are still somewhat correlated (but only weakly), and retain the original variables so that your model is interpretable. In my experience there are usually good reasons based on domain knowledge to select a particular variable from those groups.

You mention sklearn in your question so I guess you’re using Python, but I don’t know what software libraries are available in Python for this. In R there’s Hmisc::varclus().

score 0 · Answer 3 · answered Apr 04 '19 at 16:01

I think there are some issues with using LASSO/random forest for feature selection. My main problems are if you are using one of these methods for feature selection and then using those features on an entirely different algorithm afterward you could run into some issues. Random forest feature importance tells you which features were important for that random forest, it doesn't necessarily tell you which features are important for say a nearest neighbor algorithm. Similarly for LASSO.

LASSO or random forest (RF) to use for variable selection when having highly correlated features in a relatively small dataset with many features?

3 Answers3