0

For an NLP classification task I need to train two different classifiers and I've chosen to use a RandomForest and KNeighbors both the scikit-learn implementations.

My dataset is strongly imbalanced. See below the counts of documents for different subjects that are targets of classification task.

enter image description here

I have created a stratified train and test samples and with RandomForest I can use a "balanced_subsample" setting for "weights" parameter to ensure that the major classes are penalised and minor classes are boosted.

With RandomForest after tuning hyperparameters I'm able to achieve classification with F1-score of 0.59, accuracy: 0.58 and 0.82 ROC AUC score.

KNeighbors classifier does equally well but I feel like it should perform worse and it's getting the current accuracy by simply predicting the major classes correctly.

So my questions are: with the KNN classifier is there a need to add weights for an imbalanced classification? And if so, then how should I add weights to the classifier?

EDIT: I've see the abstract here: enter link description here and in the answer here: enter link description here that KNN classifier normally doesn't have any issues with imbalanced data but want to confirm this understanding is correct.

pavel
  • 239
  • 2
  • 7
  • 1
    If your classifier is classifying everything as belonging to the majority class, that may just be the Bayes optimal decision rule, see my question here https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance - nobody seems to have a method of diagnosing whether class imbalance actually is a problem (I put a small bounty on it, but no solutions to the question were posted). – Dikran Marsupial Nov 18 '21 at 19:32

0 Answers0