When working with imbalanced datasets, should one do one-hot encoding and data standardization before or after sampling techniques (such as oversampling or undersampling)?
Asked
Active
Viewed 7,936 times
1 Answers
12
It doesn't play much of a difference but you should do most pre-processing steps (encoding, normalization/standardization, etc) before under/over-sampling the data.
This is because many sampling techniques require a simple model to be trained (e.g. SMOTE uses a k-NN algorithm to generate samples, ClusteringCentroids under-sampling technique involves k-means clustering). These models have better performance on pre-processed datasets (e.g. both k-NN and k-means use euclidean distance, which requires the data to be normalized).
So, in order for the sampling techniques to work best, you should previously perform any pre-processing steps you can. That being said, if you use a random over/under-sampler, I don't think it plays much of a difference.

Djib2011
- 5,395
- 5
- 25
- 36
-
3"if you use a random over/under-sampler, I don't think it plays much of a difference." This would be a good simulation study for someone out there to do – Mark White Aug 22 '18 at 17:15