0

I am facing a problem. It's a multiclass classification problem I have 5 categories A has 107 instances B has 101 instances C has 882 instances, D has 229 instances and E has 129 instances. I used Knn, random forest and svm I got maximum accuracy score 62% . So, my question is Am I getting low accuracy score because of imbalanced data(since C has 882 instances which are far more than other categories)? or there is something else? NB: I looked the y_pred vector which has the predicted value and I noticed that all the values are 2(I encoded C as 2)why is that?

1 Answers1

0

This is happening because of the imbalanced dataset. In order to avoid overfitting you can use boosting algorithms with trees with depth 1 and do a grid search to find the best boosting parameters. you can use Adaboost in python. Another measure to take is to edit the loss function of the algorithms you tried in to have a proportional loss function. eg If you have 80% class A and 20% class B then have your loss function be: $$L = {Missclassified}_A*(0.2) + {Missclassified}_B*0.8 $$ Ofcourse, you will have t play with the numbers but the idea is there.

BIM
  • 444
  • 2
  • 12
  • Why do you suggest boosting depth one trees? – Matthew Drury Aug 19 '18 at 04:24
  • You can use other depths but the deeper trees are usually more prone to overfitting. – BIM Aug 19 '18 at 18:27
  • @BIM it is not possible to state that this is due to the class imbalance problem based on the information provided. It may be that 62% is as good as you can do on this dataset as the density of C is higher than that of any other class anywere, see my question here https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance . It could be that the hyper-parameters of the classifiers are not well tuned or any number of other issues. – Dikran Marsupial Aug 10 '21 at 17:09