2

I have a datasets that include X_train table, X_test table, Y_train table and then one Y_test table for prediction. In the feature table (X), it has around 800 columns. In the Y table (label tabel), around 200 columns . I read from this paper that it might be possible to reduce dimension from both X and Y table. I used the following code, reduced the size of X table:

  from sklearn.decomposition import PCA

  pca = PCA(0.8)
  pca.fit(X_train_normalized)

  PCA(copy=True, iterated_power='auto', n_components=0.8, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)
   
  X = pca.transform(X_train_normalized)
  X_test = pca.transform(X_test_normalized)

My question is how can I do the dimension reduction also to the Y table? Should I concatenate X train and Y train table before processing PCA and them split them or there is a better way to do it? If my approach is not correct, please kindly give your comments as well. Thanks you!

almo
  • 151
  • 8
  • $X_{train}$ and $Y_{train}$ shouldn't be concatenated because doing that we end up with a mix of the input space ($X$) and the output space ($Y$) and in that situation we wouldn't be able to tell which are the labels that have to be learned. So reducing the $X$ and $Y$ separately is the correct way to go. – Javier TG Sep 15 '20 at 13:38
  • 1
    @JavierTG, Thanks for your reply. Actually I tried it. But I guess I did wrong. I got PCA X down to 108 and Y down to the 77 but then the pca X value was reinitialized also to 77. Even though I used like pca1 and pca2. Would you mind to show me how to implement it? e.g. using pseudo code – almo Sep 15 '20 at 13:47
  • This example may help: Suppose that $X_{train}$ has one feature (1 column) and $Y_{train}$ has 1 feature (1 column), then reducing the features of $X_{train}$ concatenated with $Y_{train}$ to 1 only column wouldn't help, no? – Javier TG Sep 15 '20 at 13:55
  • 1
    @JavierTG :) I understand what you meant- why should not put together. I just don't know how to do to the Y table. Because I used pca1 and pca2 to assign the equation, so that pca would not be overwritten. However, after implementing PCA to x table, then I did PCA to y table. then both X and Y had the same component . I expected (PCA=0.8) so X component should be 108 and Y component 77. How would you do it? – almo Sep 15 '20 at 13:59
  • 1
    I'm sorry that idk how to do it with python, but in MATLAB/ Octave I would do: $[U,~,~] = \text{svd}((X') * X)$ to calculate the principal directions, then $X = X*U$ to calculate the principal components and finally $X = X(:,n_{columns})$, where $n_{columns}$ are the number of dimensions to which the data is reduced. And I would do the same to $Y$ – Javier TG Sep 15 '20 at 16:46
  • 1
    The justifications (and much more theory) related to my comment can be found here: [Theory](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca), [why use $\text{svd}(X^TX)$](https://stats.stackexchange.com/questions/314046/why-does-andrew-ng-prefer-to-use-svd-and-not-eig-of-covariance-matrix-to-do-pca/314062#314062) – Javier TG Sep 15 '20 at 18:58
  • 1
    @JavierTG Thanks! – almo Sep 15 '20 at 19:19

0 Answers0