Does SMOTE overcomplicate cross-validation?

Asked Dec 02 '21 at 03:18

Active Dec 02 '21 at 03:24

Viewed 53 times

If you create a synthetic dataset based on train, then your independent variable includes the hyperparameters, and the dataset. So you're finding the optimal way of oversampling and the optimal hyperparameters. Those two factors are not independent, so you have to fix the hyperparameters and vary the oversampling method, and then fix the oversampling method and vary the hyperparameters.

Now you gridsearchCV suffers from curse of dimensionality. Now you have too many combinations of oversampling parameters and hyperparameters and you're just brute-forcing for a solution.

SMOTE means creating a bunch of synthetic oversampled datasets. In cross-validtion your independent variable should be the tuple of hyperparameters. Your dependent variable is your CV metric. Fixed variables include the data . SMOTE makes data not fixed.

edited Dec 02 '21 at 03:24

asked Dec 02 '21 at 03:18

Germania

1

Yet another reason to prefer evaluation of the probability outputs with proper [tag:scoring-rules]! Also: https://stats.stackexchange.com/questions/357466 – Dave Dec 02 '21 at 03:21
Per the thread Dave linked to (I'm the author), I do not think it makes sense to think of an "*optimal* way of oversampling". It's kind of looking for the optimal way of shooting oneself in one's foot. – Stephan Kolassa Dec 02 '21 at 06:34

Does SMOTE overcomplicate cross-validation?

0 Answers0