2

I would like to know if there is a common recipe for when you have a dataset with a lot of variables. I have read about PCA, ICA and feature selection, but I'm not sure what I should try first and how to mix those techniques, for example with logistic regression.

What is the most recommended approach?

iamdeit
  • 143
  • 1
  • 7

1 Answers1

1

I would suggest you to think about what are the goal of "dealing with the high dimensional data". For example,

  • Are you trying to reduce the computational complexity?
  • Are you trying to have better model interpretability (with some sacrifice on accuracy)?

Many models can deal with the "feature selection" automatically, i.e., you can directly feed in with high dimensional data. And the model will not complain about it. These models include Neural Network and Random Forest. The down side of these models are they have low interpretability. If you want more interpretability than accuracy, a decision tree model or LASSO regularization on logistic regression can be used.

You mentioned about logistic regression, I would recommend you to try regularized version first. Details can be found in my answer here.

Regularization methods for logistic regression

You also mentioned about PCA. It is OK (but not recommended) to run PCA first, then run the regression model, such as PCR.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213