1

I am trying to do a correlation based feature selection for a classification model. Dataset details is given below.

Training :- 38 Samples, 7130 features. represented as T

Testing :- 34 Samples, 7130 features. represented as S

Target: 2 classes (Yes | No )

So, I need to select first 100 features highly correlated with class variable.

Here I have mentioned different approaches that I've tried, but I am not sure which approach is best. Please go through approaches given below and comment the best one.

1) Combined T and S to single table X = T + S. Let {A} be set of all features and a is an element of {A}. I calculated correlation of all a and then selected top 100 features to create a new dataset with dimension 72x100

2) I applied correlation selection on T. The selected features will be extracted from S. We get new datasets T` and S`

But, I am not sure how to do cross validation in this procedure? Please help.

Sooraj
  • 123
  • 3
  • Your question seem to be similar to this one: https://stats.stackexchange.com/questions/27750 . It seems that feature selection need to by approach 2 and it need to repeated for each fold of cross validation. See this answer for a demo (MATLAB) code: https://stats.stackexchange.com/a/27751/156469 –  May 31 '17 at 10:52
  • 1
    This is not correlation based, but. Another approach might be to use lasso. Regularized regression is meant to handle the $ 'n

    – meh May 31 '17 at 13:49

1 Answers1

2

Looks like you have to small sample size and to much features. 34+38=72 sample is to small for training classifier. Visual exporation of data is the best for your case.

Nik
  • 136
  • 4
  • 3
    Don't think visual exploration will be reliable either. Related to split-sample validation, the sample size is too small by a factor of about 200 for split-sample methods to be stable. – Frank Harrell May 31 '17 at 13:58