Approaches to reduce dimensions (feature selection/extraction) with high dimensional count data before running tree based model

Question

My dataset has ~100k samples and 3000 dimensions. The data are counts, anywhere between 0-8 and it's pretty sparse.

Because of 'curse of high dimension', I want to shrink the number of variables before running tree-based model (random forest or xgboost).

Some of the approaches I'm trying is:

run random forest and select the top K variables, and then run random forest on data with selected variables.
same as above for xgboost
run PCA to reduce dimension, then run rf or xgboost

Are these approaches appropriate? Any other suggestions? BTW, does PCA on my sparse count data make sense? I feel tt's not appropriate to do standardization on my sparse count data, which is required for PCA.

Thanks a lot for advice!

Lucas Farias · Answer 1 · 2019-02-22T08:07:22.950

1

I don't believe there's any canonical way to approach this problem of yours, so I'll try to answer gathering what I know about each issue independently:

Even in the case of sparsity, if some of your variables present some correlation, PCA will work just fine capture this.
Why would you want to reduce dimensionality? Random forest has no explicit restrictions for high dimensional problems, this question backs this. Still, I guess the goal of PCA here would be to improve interpretability and performance of the random forest algorithms. But keep in mind PCA is quadratic on the number of features, so it has its cost.
Finally, if you count data are in the same unit and range, there would be no need for standardization.

edited Feb 22 '19 at 08:07

answered Feb 22 '19 at 07:56

Lucas Farias

1,232
1
8
22

Great thanks for your answer. two things I'm not sure. 1. PCA should reduce the interpretability, right? 2. some of my variables range from 0-1, some of them 0-8, is it OK to do PCA on this directly without standardization? I personally believe so. But not total sure. – zesla Feb 22 '19 at 12:08
@zesla 1. it depends on the dataset, in you case, with 3k+ variables, I g it would be pretty hard to interpret a predictive model with some hundreds of predictors. By aggregating the alike variables, it would be easier to grasp why that factor is relevant. 2. So yes, you need standardization. One of the questions I referenced does a nice job explaining why. – Lucas Farias Feb 22 '19 at 12:43

Approaches to reduce dimensions (feature selection/extraction) with high dimensional count data before running tree based model

1 Answers1