Should SMOTE oversampling be done before or after holdout validation's training/testing split?

Question

Originally, without SMOTE, my ML learning steps go like this:

Feature vectorization
split data into X_train, X_test, y_train, and y_test
use X_train and y_train for machine learning
predict/test on X_test and y_test

I think there are two spots I could inject my SMOTE codes. One is I can inject it before the train/test data split, so that oversampling of minority class takes place in both training and testing data. Like so:

Feature vectorization
SMOTE oversampling
split data into X_train, X_test, y_train, and y_test
use X_train and y_train for machine learning
predict/test on X_test and y_test

I got very good results using the above steps, but I wonder if SMOTE should only be done in the training data, but test on the original testing data set since the latter reflects the real-world distribution of majority and minority class samples. Like so:

Feature vectorization
split data into X_train, X_test, y_train, and y_test
SMOTE done only on X_train and y_train
use X_train_SMOTE and y_train_SMOTE for machine learning
predict/test on X_test and y_test

Which is a better implementation of SMOTE?

Third method, SMOTE the training set, otherwise your testing sample is then not the "real" data. Confirmed [here](https://stats.stackexchange.com/a/111552/191128) — R. Prost, Jan 15 '18 at 11:17

Should SMOTE oversampling be done before or after holdout validation's training/testing split?

0 Answers0