Normalization/standardization: Should one do this before oversampling/undersampling the data or after?

Question

When working with imbalanced datasets, should one do one-hot encoding and data standardization before or after sampling techniques (such as oversampling or undersampling)?

score 12 · Accepted Answer · answered Aug 21 '18 at 22:21

It doesn't play much of a difference but you should do most pre-processing steps (encoding, normalization/standardization, etc) before under/over-sampling the data.

This is because many sampling techniques require a simple model to be trained (e.g. SMOTE uses a k-NN algorithm to generate samples, ClusteringCentroids under-sampling technique involves k-means clustering). These models have better performance on pre-processed datasets (e.g. both k-NN and k-means use euclidean distance, which requires the data to be normalized).

So, in order for the sampling techniques to work best, you should previously perform any pre-processing steps you can. That being said, if you use a random over/under-sampler, I don't think it plays much of a difference.

"if you use a random over/under-sampler, I don't think it plays much of a difference." This would be a good simulation study for someone out there to do — Mark White, Aug 22 '18 at 17:15

Normalization/standardization: Should one do this before oversampling/undersampling the data or after?

1 Answers1

Linked