How to deal with high dimensional data (binary classification)?

Question

I would like to know if there is a common recipe for when you have a dataset with a lot of variables. I have read about PCA, ICA and feature selection, but I'm not sure what I should try first and how to mix those techniques, for example with logistic regression.

What is the most recommended approach?

David Dunson's work with tensors and massive categorical information may be of use to you. — Mike Hunter, Jan 16 '17 at 17:48

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

1

I would suggest you to think about what are the goal of "dealing with the high dimensional data". For example,

Are you trying to reduce the computational complexity?
Are you trying to have better model interpretability (with some sacrifice on accuracy)?

Many models can deal with the "feature selection" automatically, i.e., you can directly feed in with high dimensional data. And the model will not complain about it. These models include Neural Network and Random Forest. The down side of these models are they have low interpretability. If you want more interpretability than accuracy, a decision tree model or LASSO regularization on logistic regression can be used.

You mentioned about logistic regression, I would recommend you to try regularized version first. Details can be found in my answer here.

Regularization methods for logistic regression

You also mentioned about PCA. It is OK (but not recommended) to run PCA first, then run the regression model, such as PCR.

edited Apr 13 '17 at 12:44

Community

1

answered Jan 16 '17 at 17:45

Haitao Du

32,885
17
118
213

I'm interested in both of those goals, it would be helpful for the community to know the different approaches. – iamdeit Jan 16 '17 at 17:48
@diugalde that question is too broad to answer here. – Haitao Du Jan 16 '17 at 17:50
I'm new in this site, where should I post it? – iamdeit Jan 16 '17 at 17:51
1

@diugalde do not know where should you post. This is a QA site, the more specific the question is, the better other people can help. If you want some comprehensive overview. May be read some books? – Haitao Du Jan 16 '17 at 17:53
@HaitaoDu Thanks, could you explain why PCA is not recommended? – haneulkim May 07 '21 at 23:04

How to deal with high dimensional data (binary classification)?

1 Answers1