3

Originally, without SMOTE, my ML learning steps go like this:

  1. Feature vectorization
  2. split data into X_train, X_test, y_train, and y_test
  3. use X_train and y_train for machine learning
  4. predict/test on X_test and y_test

I think there are two spots I could inject my SMOTE codes. One is I can inject it before the train/test data split, so that oversampling of minority class takes place in both training and testing data. Like so:

  1. Feature vectorization
  2. SMOTE oversampling
  3. split data into X_train, X_test, y_train, and y_test
  4. use X_train and y_train for machine learning
  5. predict/test on X_test and y_test

I got very good results using the above steps, but I wonder if SMOTE should only be done in the training data, but test on the original testing data set since the latter reflects the real-world distribution of majority and minority class samples. Like so:

  1. Feature vectorization
  2. split data into X_train, X_test, y_train, and y_test
  3. SMOTE done only on X_train and y_train
  4. use X_train_SMOTE and y_train_SMOTE for machine learning
  5. predict/test on X_test and y_test

Which is a better implementation of SMOTE?

Shayan Shafiq
  • 633
  • 6
  • 17
KubiK888
  • 927
  • 1
  • 10
  • 21
  • 3
    Third method, SMOTE the training set, otherwise your testing sample is then not the "real" data. Confirmed [here](https://stats.stackexchange.com/a/111552/191128) – R. Prost Jan 15 '18 at 11:17

0 Answers0