-1

I am working with a dataset with 41 features and a class label. To enhance the runtime, I decided to do Dimensionality Reduction also. I am doing Cross-Validation also. I used PCA for reducing the features at the very beginning and then feeding the features to Classifier and Cross-Validation. Is this the right way? Any inputs on this will be appreciated.

S. P
  • 105
  • 5

2 Answers2

1

No, this is not the correct way to set up your cross-validation. By using all the data in the feature selection step, you are "cheating" in a way, since you are picking features that are relevant to the population as a whole. You will end up with an over-optimistic classifier, and your results will not be reflective of what you would expect to find by applying the same features and classifier to a new set of data.

To do this correctly, first split your data into your train/test folds. Do feature selection for each training set, and build a classifier on that training set. Only then are you allowed to utilize the test data to evaluate the classifier. This does complicate matters a little, since you'll wind up with 10 different feature sets and 10 different classifiers (for 10-fold classification), but it's the only unbiased way to do it.

Nuclear Hoagie
  • 5,553
  • 16
  • 24
  • I thought the question is about dimension reduction?? – SmallChess Mar 15 '17 at 14:30
  • @StudentT, I think this answer is using "feature selection" and "dimensionality reduction" interchangeably. I'm not saying this is the best way to put his ideas into words, but the procedure he described is correct. Maybe it'd be better if he edited his post a bit. – darXider Mar 15 '17 at 14:41
  • @darXider I read it "dimension reduction" as PCA. – SmallChess Mar 15 '17 at 14:42
  • 2
    @StudentT, I understand; what this answer is saying, if I am understanding it correctly, is this: for each fold of the CV, apply PCA to your training data; pick, say, 5 first PCs and throw the rest away; train your model for this fold using only these 5 PCs as new features; project your test data to the same low-dimensional space used in training; use the new test data to estimate performance for this fold. – darXider Mar 15 '17 at 14:49
  • @darXider No. OP wrote "PCA at the very beginning ... and then Cross-Validation ...". He was using the factors as features. – SmallChess Mar 15 '17 at 14:50
  • I agree, and he is asking if that is the correct way to do it (see the last two sentences of the OP). The answer is no, as Matt explained. I think both the question and the answer need a bit of editing. – darXider Mar 15 '17 at 14:53
  • @Matt yes Feature Selection needs to be done in each fold. I get it. Do you have any resource where I can learn something about it's implementation? – S. P Mar 15 '17 at 15:05
1

The procedure you suggest is correct, as PCA does not "see" the class labels. It would be wrong to select features before cross-validation if you were using a method that uses any information about the class labels, in this case you are 'leaking data' from your validation set into your training set. With PCA, however, you're not doing that.

All of this is nicely explained in Sect. 7.10.2 of Elements of Statistical Learning by Hastie, Tibshirani and Friedman.

msh
  • 26
  • 5
  • 1
    Please also refer to http://stats.stackexchange.com/questions/55718/pca-and-the-train-test-split and http://stats.stackexchange.com/questions/114560/pca-on-train-and-test-datasets-do-i-need-to-merge-them. – darXider Mar 15 '17 at 15:02
  • Thanks. I get that I need to do PCA in each fold. And that is where I am stuck. Any light on doing it would be of great help. – S. P Mar 15 '17 at 15:35
  • @S.P, if you are using Python, see this post: http://stats.stackexchange.com/questions/144439/applying-pca-to-test-data-for-classification-purposes – darXider Mar 15 '17 at 16:04
  • @darXider If I want to go further and test my model on a separated Hold Out set, should I include PCA there also? – S. P Mar 16 '17 at 09:00
  • @S.P, yes. you should follow the same steps that you did through your CV for the final test on the hold-out set. that is, after you have selected the best hyperparameters through cross-validation, you fit the model to the entire training set (where your model now is a pipeline of PCA + classifier) and then apply the trained model (that is, the trained pipeline of PCA + classifier) to the hold-out set. this is explained in the SE link in my previous comment above. – darXider Mar 16 '17 at 13:22