3

I tried to search for an answer about the evaluation and interpretation of a pattern learned with PCA, on data from a Testing Set, but I found no answer. Let me explain my situation:

My Data Mining task is to implement an instance of the KDD process in order to learn an accurate classifier of Windows applications.

I have a single "dataset.csv" file on which I do the initial split into Training Set and Testing Set (the entire learning & validation phases will focus exclusively on the Training Set).

My choice is to learn a classifier with the Gaussian Naive Bayes algorithm, so I decide to apply the transformation of my Training Set through the PCA, in order to ensure independence between the attributes (so, for example, on my Training Set with 50 features, I decide to transform it into a new Training Set with 10 Principal Components and learn & build from it my pattern with the Gaussian Naive Bayes and identifying the best configuration with a GridSearchCV).

Once learned my pattern, I have to perform the final evaluation on the whole Testing Set splitted at the start.

Here's my problem: The pattern was learned on a Training Set composed of 10 columns (i.e. 10 Principal Components), but my Testing Set is composed of 50 features (like my Training Set), and this conflicts with the learned pattern because the number of columns is different and I cannot make predictions. What should I do?

EDIT

To be correct, I specify the small details of my problem so that you can give me as complete an answer as possible.

I will describe very quickly what I did:

  1. I split the starting dataset into Training Set and Testing Set using sklearn.model_selection.train_test_split with test_size=0.3 (i.e. 30%).
  2. After splitting, I performed (on the Training Set) Data Cleaning (replacing any missing values with statistical.mode for each column) and Data Scaling using sklearn.preprocessing.MinMaxScaler.
  3. After this little PreProcessing, I performed the PCA on the entire Training Set's indipendent variables, thus obtaining the new Training Set composed of 10 columns (i.e. the 10 Principal Components) and then I used sklearn.model_selection.GridSearchCV to learn the best configuration of sklearn.naive_bayes.GaussianNB (which is the estimator parameter), for the different var_smoothing values that I insert in a list and associate with the param_grid parameter, and setting cv=5 parameter for K-Fold Cross Validation used by GridSearchCV and finally fit on the entire Training Set.

This is how I learn my model, which now I have to test on the Testing Set.

I would like slightly more clarity on what should be done from this point on, so my doubt is: Would it be correct to perform the PCA exclusively on the Testing Set, calculating the same number of Principal Components and then evaluating?

DavidZoy
  • 33
  • 5
  • One would apply the PCA to all dataset before splitting to test/train. – msuzen Jun 29 '21 at 14:33
  • 1
    That's totally wrong, the reason is that you should train your model only on the training data, without using any information regarding the testing data. If you apply PCA on the whole data (including the test data) before training the model (so, before the split), then you in fact use some information from the test data. Thus, you cannot really judge the behaviour of your model using the test data, because it is not an unseen data anymore. – DavidZoy Jun 29 '21 at 15:34
  • Not really. PCA is "unsupervised". There is no peaking here. Test set is still unseen data, instances on the test set are not used in building the model. – msuzen Jun 29 '21 at 18:50
  • @MehmetSuzen it causes data leakage, a somewhat related discussion: https://stats.stackexchange.com/questions/55718/pca-and-the-train-test-split – gunes Jun 29 '21 at 19:02
  • That's right @gunes, so should not be wrong by me doing the PCA on the Testing Set during the Validation Step of the KDD, should it? The PCA on the Testing Set will be related to the basic practices, such as Data Cleaning and Data Scaling etc., because in order to validate the pattern learned from the Training Set through PCA, I have to bring back the Testing Set by performing the PCA with the same number of components (used for the Training Set), correct me if I'm wrong. – DavidZoy Jun 29 '21 at 19:14
  • @gunes Thank you for the link. I could see the general practice in the community. But I am not convinced that this would create any "target leakage". PCA do not interact with the labels. With the strict definition of "target leakage" there is no leakage if PCA is build on entire-set, as long as there is no target leakage inherently. ( see https://dl.acm.org/doi/10.1145/2382577.2382579 ) . DavidZoy's question, if PCA is build on training, test set can be projected separately https://stats.stackexchange.com/questions/405660/pca-in-production-use-with-incoming-data – msuzen Jun 29 '21 at 19:21
  • @MehmetSuzen That practice is a poor one. The test set mimics data that do not yet exist. Remember that Siri is supposed to be able to do speech recognition on people who have not yet been born. – Dave Jun 29 '21 at 19:37
  • 1
    @MehmetSuzen thanks for the article link. I couldn't exactly find parts supporting your idea (just skimmed). It's not only SO community, for example, the data leakage section in sklearn documentation exactly describes this kind of scenario: https://scikit-learn.org/stable/common_pitfalls.html . Simply, introducing information about the test set may introduce some trends in features that may not be readily available in training data. Thus, target or not, introduction of test set always carry a danger of leakage, despite one might get away with it on occasions. – gunes Jun 29 '21 at 19:39
  • @gunes and Dave: Many thanks for the interesting discussion. I have already express my view, so won't comment further but there is another relevant discussion which has a similar nature to ours : https://stats.stackexchange.com/questions/239898/is-it-actually-fine-to-perform-unsupervised-feature-selection-before-cross-valid – msuzen Jun 29 '21 at 20:21

1 Answers1

3

Just a minor correction: after PCA, you use the projections onto the principal components as features, not the PCs themselves. But, you'll have reduced set of features as you mentioned, say 10.

You'll set up a pipeline (e.g. you can utilize the Pipeline object in scikit-learn as I understand from your notation, you're using it) with steps PCA and GaussianNaiveBayes, and use grid search for hyper-parameter optimization (HPO).

This is different your proposed solution. In your second and third steps, you also introduce some leakage to the validation folds because you did PCA & data scaling beforehand. As I mentioned above, you should think all the operations you performed as a single model/pipeline and apply CV to it. This is harder to implement in code if you don't use pipelines, but it's the right thing to do.

Finally, with the best HPs selected, the final model (pipeline) will be fitted on the training set. This fitted model can predict the test set as well, because the pipeline has PCA step with PCs found for the training set, and there will be no dimension mismatch issue.

To reiterate, you won't fit PCA or scaling to the test set, you'll use fitted models/objects on the training set to be applied on the test set.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • I didn't know the Pipeline library from sklearn, so I didn't use it. To be correct, I will add a little more details to the question, so that you can give me a more precise answer. – DavidZoy Jun 29 '21 at 20:49
  • @DavidZoy You don't have to use it, it's just very convenient for this kind of situations. I've amended my answer. – gunes Jun 29 '21 at 21:25
  • Now it's clearer, thank you so much! I will try to study your solution to understand how to adapt with the Pipeline. – DavidZoy Jun 29 '21 at 21:30
  • I have one last question to ask: I tried to use the `Pipeline` and integrate it with the `GridSearchCV` to perform the `PCA`, but it doesn't perform the data transformation as written in the library. maybe I'm doing something wrong, but these are the steps I do: `pipe = Pipeline(steps=[('pca', PCA()), ('GaussianNB', GaussianNB())])` then `grid_values = {'GaussianNB__var_smoothing': np.logspace(0, -9, num=100), 'pca__n_components': components}` and `grid = GridSearchCV(pipe, param_grid=grid_values, scoring='accuracy', cv=folds)` and finally `grid.fit(X_Train, Y_Train)` – DavidZoy Jul 01 '21 at 17:42
  • The only doubt I have is that the transform only works in the test phase, but so, how does it learn the model based on the transformed data if the transform is not performed during the training phase? Because, after the `fit` I do `print(pipe.steps[0][1].explained_variance_)` but it returns the error: `AttributeError: 'PCA' object has no attribute 'explained_variance_'` – DavidZoy Jul 01 '21 at 17:46
  • 1
    Use `grid.best_estimator_` for the fitted model – gunes Jul 01 '21 at 18:52
  • I saw, but there is no way to observe the attributes for the PCA used during the Training phase, right? Like `explained_variance_` or `explained_variance_ratio_` etc. – DavidZoy Jul 01 '21 at 19:44
  • 1
    You should have access to them in the `best_estimator_` object. – gunes Jul 01 '21 at 19:58
  • 1
    I found them, thank you again! – DavidZoy Jul 01 '21 at 20:06