SMOTE in unbalanced dataset with binary features

Question

after reading different posts about unbalanced datasets I didn't make my mind clear about my specific problem so that's why I'm posting a new question.

In my case, I have a dataset with around 20K rows and 40 features. I'm trying to do binary classification but in the data the minority class is only the 7% of the instances. I read about using different sampling methods to deal with this problem. Among those I tried SMOTE by using the "unbalanced" R package but I have several doubts about if this package is doing well with my data. From those 40 features I have only 1 that is numeric one (age) and all the others are binary features (yes/no for given diseases). As far as I know, SMOTE works with continuous data since it calculates the Euclidean distance among neighbors.

Does any of you knows if I'm doing correctly by applying this technique to my dataset with binary features?? And in case it's not, how could I manage this problem??

Thanks you so much in advance.

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Oct 28 '19 at 13:39

score 1 · Answer 1 · answered Oct 28 '19 at 15:20

Unless the age feature is very important, SMOTE will not amount to much more than random oversampling with replacement in this case, assuming you are forcing the binary attributes to be exactly 0 or 1.

This is because the synthetic examples will necessarily be equal to one of the two original examples used in their creation (whichever the random weights are closest to).

The proper solution to your problem depends on what the problem really is.

If your problem is relative class imbalance, i.e. you are worried that the classifier will give too much weight to either false positives or false negatives because of the relative weight of the classes in your dataset, then you can look into cost-sensitive learning (ideal if you can determine the costs of different types of mistakes) or random sampling methods. I'm sure there's a synthetic oversampling method out there designed for binary data as well, but I wouldn't count on it making a huge difference.

However, if what you are worried about is the dearth of minority class data, i.e. you believe that you don't have a representative sample of that class (for example, you might be having trouble classifying very rare cases when they only occur once in your dataset), then finding more data of that class is really the only option that works. See http://tjo018.inha.ac.kr/Achievements/Research/Journals/Journal2004_02.pdf for more details on this particular problem.

Thanks you so much for your reply @Vincent B. Lortie, in my case I would say that I'm facing the relative class imbalance "problem". I'm worried that the classifier will give too much importance to the negative class for being represented as the 91% of the available data. Although SMOTE was improving my results slightly I wasn't sure about if it was doing the right thing. Now with your brief explanation I realized that is like a random oversampling with replacement. In this case I will dive into the cost-sensitive learning to see if I can reach better results than the ones I'm having now. — Jose LHS, Oct 29 '19 at 14:24

score 0 · Answer 2 · answered Dec 26 '20 at 18:25

On top of cost-sensitive learning proposed by Vincent B. I would like to add that you are also able to undersample your data, or even ensemble models like

model 1: Undersampled majority class + entire minority

model 2: Undersampled majority class + entire minority

. . .

model 3: Undersampled majority class + entire minority

provided that there is some diversity among the undersampled majority classes and the model maybe. Then you may consider a method to get your desired output such as classification based on majority.

SMOTE in unbalanced dataset with binary features

2 Answers2