4

I have a high dimensional dataset ($n \times p$: $30 \times 100$) which I want to use as an testing dataset to build a two group classifier (LDA or QDA). I've read that you can do PCA to do an dimension reduction of your dataset to select to most important features. But I'm a bit confused what you use exactly as the input to build the classifier. I'm familiar with PCA using SVD and what it means.

Consider following situation:

  • I do a SVD of my dataset.
  • I look at the scores of the first couple of principal components.
  • When I assign my scores a label indicating from which group they come, I see that the 3th PC best separates my 2 groups (although it only explains 7% of the total variance).

What do I do next?

  1. I take the 3th PC transform to the original parameter space (scores * loadings * scale + mean) and build my classifier
  2. I look at the loadings in the 3th PC and try to decide which parameters in my original parameter space are important and build a classifier only using these.
  3. ...

Option 2 seems the most sensible in my opinion but I'm not entirely sure. Also If I see that only the 3th PC is important to explain the variance in my two groups, can I forget about the first two PC in my further analysis?

amoeba
  • 93,463
  • 28
  • 275
  • 317
statastic
  • 261
  • 1
  • 10
  • It is unclear whether you want just dimensionality reduction (given that you have n

    – ttnphns May 14 '14 at 20:12
  • If your wish is the first one I said about, i.e. only the fact that n

    – ttnphns May 14 '14 at 20:33
  • See also: [Best practice for dimensionality reduction with PCA and LDA: does it make sense to combine them?](http://stats.stackexchange.com/questions/106121) – amoeba Dec 22 '14 at 15:41

1 Answers1

2

If you're going to do LDA after PCA I would keep the first k components. Don't try and figure out which of the components are important and don't go back to the original parameter space. You can feed in the k dimensional data into your LDA classifier and let it figure out what is important there.

Aaron
  • 3,025
  • 14
  • 24
  • You mean use the principal components to build your classifier? How is this possible? How can your classifier know to look at the correct parameters without when you give it test data (which has all p parameters) without the information in your loadings. Or do you have to transform somehow your test data and your future data? – statastic May 14 '14 at 19:23
  • You always do the exact same transformations to your test data as you did to your training data. – Aaron May 14 '14 at 19:57
  • Ok, I think I got it. So the following should be the right thing to do with the test data afterwards. (test_data-mean_training) * scale_training * q loadings_train – statastic May 15 '14 at 08:08
  • Looks cgood to me. – Aaron May 15 '14 at 15:26
  • 1
    If you want to feed only scores into the LDA training that actually have discriminativ power, consider using e.g. PLS instead of PCA for the dimensionality reduction step. – cbeleites unhappy with SX May 18 '14 at 10:13