Imbalanced classification using K-nearest neighbors classifier

Question

For an NLP classification task I need to train two different classifiers and I've chosen to use a RandomForest and KNeighbors both the scikit-learn implementations.

My dataset is strongly imbalanced. See below the counts of documents for different subjects that are targets of classification task.

I have created a stratified train and test samples and with RandomForest I can use a "balanced_subsample" setting for "weights" parameter to ensure that the major classes are penalised and minor classes are boosted.

With RandomForest after tuning hyperparameters I'm able to achieve classification with F1-score of 0.59, accuracy: 0.58 and 0.82 ROC AUC score.

KNeighbors classifier does equally well but I feel like it should perform worse and it's getting the current accuracy by simply predicting the major classes correctly.

So my questions are: with the KNN classifier is there a need to add weights for an imbalanced classification? And if so, then how should I add weights to the classifier?

EDIT: I've see the abstract here: enter link description here and in the answer here: enter link description here that KNN classifier normally doesn't have any issues with imbalanced data but want to confirm this understanding is correct.

If your classifier is classifying everything as belonging to the majority class, that may just be the Bayes optimal decision rule, see my question here https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance - nobody seems to have a method of diagnosing whether class imbalance actually is a problem (I put a small bounty on it, but no solutions to the question were posted). — Dikran Marsupial, Nov 18 '21 at 19:32

Imbalanced classification using K-nearest neighbors classifier

0 Answers0