9

When working with imbalanced datasets, should one do one-hot encoding and data standardization before or after sampling techniques (such as oversampling or undersampling)?

Mark White
  • 8,712
  • 4
  • 23
  • 61

1 Answers1

12

It doesn't play much of a difference but you should do most pre-processing steps (encoding, normalization/standardization, etc) before under/over-sampling the data.

This is because many sampling techniques require a simple model to be trained (e.g. SMOTE uses a k-NN algorithm to generate samples, ClusteringCentroids under-sampling technique involves k-means clustering). These models have better performance on pre-processed datasets (e.g. both k-NN and k-means use euclidean distance, which requires the data to be normalized).

So, in order for the sampling techniques to work best, you should previously perform any pre-processing steps you can. That being said, if you use a random over/under-sampler, I don't think it plays much of a difference.

Djib2011
  • 5,395
  • 5
  • 25
  • 36
  • 3
    "if you use a random over/under-sampler, I don't think it plays much of a difference." This would be a good simulation study for someone out there to do – Mark White Aug 22 '18 at 17:15