Unbalanced dataset classification problem

Question

I have a binary classification problem and I'm working with an unbalanced dataset. The count for each class in the training set looks like:

Training set:
Class 0: 29 cases
Class 1: 6246 cases

Test set:
Class 0: 2678 cases
Class 1: 12 cases

I applied the under-sampling technique and now there are for the training set:

Class 0: 29 cases
Class 1: 29 cases

After working with the Decision Trees algorithm, these are the obtained results:

Accuracy: 98.85%
Sensitivity: 0.00%
Specifity: 99.55%

The confusion Matrix of the training set:

[[   7   5]
 [  1446 1232]]

The confusion Matrix of the test set:

[[   0   12]
 [  19 2659]]

How I should fix this problem? The train_test_split proportion is 0.3 I should decrease it?

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101, stratify=y)

If you split the train/test, why don't you preserve the class ratios? It seems like you create the extreme imbalance. — gunes, Dec 01 '20 at 09:47
@gunes It seems, I'm working with sklearn, to split it I used the stratify=y, which I understood that preserves the class ratios, Is there another way? — notarealgreal, Dec 01 '20 at 09:57
My favorite tweet is by our Frank Harrell and is about SMOTE: https://twitter.com/f2harrell/status/1062424969366462473 — Dave, Dec 01 '20 at 12:28
@Dave using the random oversampling or SMOTE oversampling the results in case of the decision tree algorithm are pretty much the same, I should try Random Forest, Random Tree and some other ensemble alogrithms? I was just expecting a little improvement with the over sampled dataset for training independently of the algorithm — notarealgreal, Dec 01 '20 at 12:48

Sri Ganesh Tallapaneni · Answer 1 · 2020-12-01T11:17:58.300

0

way1: Try to give weight to your minority class. For Random forest you have class_weight. give class_weight = {class_label: class_weight}.

way2: Try to create synthetic data for your minority class. You can use SMOTE to create synthetic data.

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2) 
x_train_, y_train_ = sm.fit_sample(x_train, y_train)

edited Dec 01 '20 at 11:17

answered Dec 01 '20 at 11:07

I used the second option to, smote oversampling technique implemented with **imblearn.over_sampling**, but the results are pretty much the same. – notarealgreal Dec 01 '20 at 11:19

1 Answers1