1

I am running a decision tree and to balance the class labels I used SMOTE. The dataset originally consisted of 350k records and after the balancing is 1.400k records, and the resultant decision tree has 10 terminal nodes, so it has 10 decision rules for such terminal nodes.

The problem arises when I apply such 10 rules to the 350K original records, because one of those rules do not match the conditions of the imbalanced dataset (350k records). In other words the "problematic" decision tree rule was built entirely of synthetic records which can be applied to the balanced dataset (1.400k records), but not on the imbalanced dataset. So, I am calling this a “synthetic” rule

So, my question is if I am doing something wrong or it is expected to have a synthetic” rule?

Best Regards

mauron
  • 11
  • 2

1 Answers1

2

If you have discrete variables (which is generally considered bad practice with SMOTE, I guess for exactly this reason), this can happen quite easily. Suppose you have a binary variable. Then some of the synthetic data points are likely to have fractional values for that variable, and a tree can separate those samples out with two splits.

Categorical variables can be dealt with (Oversampling with categorical variables), but if you leave in discrete numerical features you leave yourself open to this particular kind of synthetic rule. Presumably, similar but more subtle "holes" in your dataset could similarly be filled by SMOTE and then picked out by your tree model.

Finally, please consider whether you actually need to balance your dataset.
What is the root cause of the class imbalance problem?
When is unbalanced data really a problem in Machine Learning?
Class imbalance in Supervised Machine Learning

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15