Using PCA to reduce features, then training a random forests on these features. How do I test my model if I can't do PCA on the test set?

Question

I have extracted image features from medical images using a convolutional neural network and I am combining them with clinical features like age, gender, etc. There are 2048 extracted features and I am using PCA to reduce it down to 7 components. These 7 are the features that are combined with the clinical data and these features make up the dataset to train a random forests classifier.

Training data set X undergoes PCA

(n, d) = X.shape
X = X - np.tile(np.mean(X, 0), (n, 1))
(l, M) = np.linalg.eig(np.dot(X.T, X))
X = np.dot(X, M[:, 0:7])

Now X is used as the dataset to train the random forests. However, from what I have read, I cannot simply repeat this process with my testing dataset Y. I am not sure why not, and I am not sure how to actually test my model if I can't reduce the extracted image features of the testing dataset down to 7.

You can't do PCA separately on the test data because the choice of sign is arbitrary. https://stats.stackexchange.com/questions/269428/reverse-the-sign-of-pca Instead, use the same basis to project the test data. A library like `sklearn` gives the process a simple interface by giving the PCA class distinct `fit` and `predict` methods. — Sycorax, Mar 23 '20 at 21:05
Yes, you could use the principal component score coefficients, which can be used to calculate PCs for new test objects. — , Mar 23 '20 at 21:44

Using PCA to reduce features, then training a random forests on these features. How do I test my model if I can't do PCA on the test set?

0 Answers0