How to use SMOTE on the final model training?

Question

I have three datasets: train, validation, and test (all datasets are labeled).

When I have tuned the hyperparameters using random search, I applied SMOTE just on the train data.

Now, after I found the best hyperparameters values, I want to train the model on all the label data that I have, so I want to concatenate all the three datasets and train the model on the "new" dataset.

How should I use SMOTE in this case?

I don't want that model will "prefer" to predcit the majority label — Amit S, Jan 13 '22 at 11:14
In that case, it is better to use evaluation measures that do not reward biased classifications. See [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) and [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) — Stephan Kolassa, Jan 13 '22 at 11:17
Why not? // Many models, such as neural networks and logistic regression (which actually is a neural network of sorts, just one with no hidden layer), do not directly predict labels, so this notion of preferring the majority label is a bit dubious. It might help to say what the goals of your modeling are. — Dave, Jan 13 '22 at 11:17

Dikran Marsupial · Accepted Answer · 2022-01-14T16:42:25.223

If you are going to use SMOTE, it should only be applied to the training data. This is because you are using SMOTE to gain an improvement in operational performance, and both the validation and test sets are there to provide an estimate of operational performance. In the case of the validation set it is so that we can choose hyper-parameters that give the best operational performance. In the case of the test set so that we have an unbiased estimate of how well the system will perform in operational use.

For retraining on the amalgamated train, validation and test sets afterwards, then you need to apply SMOTE to those in the same way that you did in the training set. However, if any of your hyper-parameters are sensitive to the size of the training set (and regularisation parameters will be) then you need to do the model selection again, so I would just amalgamate the training and test sets (and apply SMOTE to them) and perform the model selection again using the validation set for tuning the hyper-parameters without SMOTE so you can estimate operational performance.

SMOTE basically does two things: Firstly it resamples the dataset to give greater promenance to the minority class. This is basically saying that misclassifying a minority class example as belonging to the majority class is a worse kind of error than misclassifying a majority class example as belonging to the minority class. This is basically cost-sensitive learning. Most modern classifiers (and a lot of old ones as well) can deal with different misclassification costs in a different way, by weighting the examples from each class differently in the cost function, or by making a probabilistic classifier and changing the threshold value away from 0.5 to some lower probability (assuming the minority class is the "positive" class). That is likely to be rather more efficient as it doesn't increase the size of the training set.

The other thing SMOTE does is to apply some regularisation. It "blurrs" the training data by adding synthetic examples that conceal the exact location of the training examples and makes them harder to memorize (i.e. it mitigates overfitting, which can be a problem if you heavily weight a small number of minority class examples). However, again most modern classifier systems (and a lot of old ones) have built in forms of regularisation that are likely to be better. The form of regularisation used in SMOTE is a bit weird in that it implies linear structures in the data that aren't part of the data generating process - just adding noise to the training examples would have a similar effect, with a more regular blurring.

In short, if you are using a modern classifier system (or a good old one like regularised logistic regression), and using it well, then SMOTE is probably not going to help much (and may make things much worse - YMMV).

The real question is deciding what performance criterion is likely to be relevant for your application, so rather than use accuracy, minimise the expected misclassification loss (taking into account the different false-positive and false-negative costs). If the misclassification costs are not equal, accuracy is not a good performance metric‡!

‡ For applications where the false-positive and false-negative costs are equal and minimising expected loss is the goal, then accuracy is a good criterion for performance evaluation (but maybe not for model selection, which is not the same thing, see my answer here: Why is accuracy not the best measure for assessing classification models?).

SMOTE is an invalid method and has been discussed above makes everything you learn from training data no longer apply to other datasets unless you contort them the same way that SMOTE contorts the training data. Use of SMOTE is almost always a symptom of choosing the wrong approach (classification instead of prediction) and the wrong accuracy metrics (discontinuous vs continuous measures). In broad generality, any method that makes you exclude good data ("data amputation") is unscientific and non-reproducible. — Frank Harrell, Jan 13 '22 at 13:15

How to use SMOTE on the final model training?

1 Answers1