For multiclass classification purpose I have to use a imbalanced dataset

Question

I am facing a problem. It's a multiclass classification problem I have 5 categories A has 107 instances B has 101 instances C has 882 instances, D has 229 instances and E has 129 instances. I used Knn, random forest and svm I got maximum accuracy score 62% . So, my question is Am I getting low accuracy score because of imbalanced data(since C has 882 instances which are far more than other categories)? or there is something else? NB: I looked the y_pred vector which has the predicted value and I noticed that all the values are 2(I encoded C as 2)why is that?

Biased generally means "not representative of the underlying population" in this context. The word you want is "imbalanced". — Matthew Drury, Aug 18 '18 at 17:16
Try constructing pairwise classifiers to see if you can make meaningful classifications between pairs of classes. That may give some insight into the problem — Dikran Marsupial, Aug 10 '21 at 17:10

score 0 · Answer 1 · answered Aug 19 '18 at 03:17

0

This is happening because of the imbalanced dataset. In order to avoid overfitting you can use boosting algorithms with trees with depth 1 and do a grid search to find the best boosting parameters. you can use Adaboost in python. Another measure to take is to edit the loss function of the algorithms you tried in to have a proportional loss function. eg If you have 80% class A and 20% class B then have your loss function be: $$L = {Missclassified}_A*(0.2) + {Missclassified}_B*0.8 $$ Ofcourse, you will have t play with the numbers but the idea is there.

answered Aug 19 '18 at 03:17

BIM

444
2
12

Why do you suggest boosting depth one trees? – Matthew Drury Aug 19 '18 at 04:24
You can use other depths but the deeper trees are usually more prone to overfitting. – BIM Aug 19 '18 at 18:27
@BIM it is not possible to state that this is due to the class imbalance problem based on the information provided. It may be that 62% is as good as you can do on this dataset as the density of C is higher than that of any other class anywere, see my question here https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance . It could be that the hyper-parameters of the classifiers are not well tuned or any number of other issues. – Dikran Marsupial Aug 10 '21 at 17:09

For multiclass classification purpose I have to use a imbalanced dataset

1 Answers1