Imbalanced data classification with GLM giving very poor results

Question

I have a loan defaulters dataset and it is highly imbalanced as shown below:

   0     1 
 33108   673

I have tried SMOTE to balance the dataset, as shown below:

smoted_data <- SMOTE(state~., deliq, perc.over=200, perc.under = 800)

after applying SMOTE, when i trained by glm(), as given below:

model1 <- glm(state~. -LanID -Month -LastMonthBnc -DELINQ.NON.DELINQ, data = smoted_data, family = "binomial", maxit = 500)

it was able to capture class "1" to good degree(though with high error 23% accuracy in class 1 prediction):

       0     1
    0 31334  1774
    1   140   533

However when i tested it on test data, it was extremely poor(class 1 prediction 4% accuracy):

 pred_class_30
    0     1
0 10154   149
1   210     7

This indicates that my model is over-fitted and i must go for generalization.

My question is 23% accuracy in training data is still not so good, so any other method that can help me to improve the accuracy of such imbalanced data set?

I have checked all the existing similar posts but could not find anything talking about how to improve the accuracy of minority class specially when model is overfitted....

Dups:https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem, https://stats.stackexchange.com/questions/347736/effect-of-class-0-1-proportion-on-logistic-regression-estimated-probability, https://stats.stackexchange.com/questions/107874/how-to-deal-with-a-skewed-class-in-binary-classification-having-many-features, https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression, https://stats.stackexchange.com/questions/93456/collecting-training-data-for-document-classification-with-unbalanced-classes — kjetil b halvorsen, Dec 05 '19 at 15:35
Can you tell us how you made predictions from the estimated probabilities the binomial glm gave you, and why you did not use a proper score function in the validation: https://stats.stackexchange.com/questions/109851/using-proper-scoring-rule-to-determine-class-membership-from-logistic-regression — kjetil b halvorsen, Dec 06 '19 at 14:42

Imbalanced data classification with GLM giving very poor results

0 Answers0