My dataset has ~100k samples and 3000 dimensions. The data are counts, anywhere between 0-8 and it's pretty sparse.
Because of 'curse of high dimension', I want to shrink the number of variables before running tree-based model (random forest or xgboost).
Some of the approaches I'm trying is:
run random forest and select the top K variables, and then run random forest on data with selected variables.
same as above for xgboost
run PCA to reduce dimension, then run rf or xgboost
Are these approaches appropriate? Any other suggestions? BTW, does PCA on my sparse count data make sense? I feel tt's not appropriate to do standardization on my sparse count data, which is required for PCA.
Thanks a lot for advice!