8

Do you think that unbalanced classes is a big problem for k-nearest neighbor? If so, do you know any smart way to handle this?

Peter Smit
  • 2,030
  • 3
  • 23
  • 36

4 Answers4

18

I believe Peter Smit's response above is confusing K nearest neighbor (KNN) and K-means, which are very different.

KNN is susceptible to class imbalance, as described well here: https://www.quora.com/Why-does-knn-get-effected-by-the-class-imbalance

TracerfireMD
  • 181
  • 1
  • 2
9

Imbalanced class sizes are both a theoretical and practical problem with KNN which has been characterized in machine learning literature since at least 2003. This is particularly vexing when some classes have a low occurrence in your primary dataset (ex: fraud detection, disease screening, spam filtering).

A google scholar search 1 shows several papers describing the issue and strategies for mitigating it by customizing the KNN algorithm:

  • weighting neighbors by the inverse of their class size converts neighbor counts into the fraction of each class that falls in your K nearest neighbors
  • weighting neighbors by their distances
  • using a radius-based rule for gathering neighbors instead of the K nearest ones (often implemented in KNN packages)

I've also found these two blogs helpful for general background on imbalanced class sizes.

  • Does this mean with Sklearn KNeighborsClassifier that using the parameter weights = 'distance' can help in case of imbalanced data. Do you know how to pass the matthews_corrcoef metric to a Sklearn KNeighborsClassifier classifier? – Claude COULOMBE Oct 27 '19 at 21:29
  • It can help, but success will vary by particular problem. On problems I've worked on, 1/d weighting is usually insufficient for results I need. – JosiahJohnston Jul 22 '20 at 21:09
2

I would like to add one remark - knn is sensitive to let say number of observation on the boundery of given class to the total number of observation in that class. If you have three classes with the same number of observations from the same distribution but with different means and second class is visiably cloud between two others - its expected value is between two others, then there is more missclassfications in the class number two. But something like this hold for every classifier.

Qbik
  • 1,457
  • 2
  • 17
  • 27
0

In principal, unbalanced classes are not a problem at all for the k-nearest neighbor algorithm.

Because the algorithm is not influenced in any way by the size of the class, it will not favor any on the basis of size. Try to run k-means with an obvious outlier and k+1 and you will see that most of the time the outlier will get its own class.

Of course, with hard datasets it is always advisable to run the algorithm multiple times. This is to avoid trouble due to a bad initialization.

tdc
  • 7,289
  • 5
  • 32
  • 62
Peter Smit
  • 2,030
  • 3
  • 23
  • 36
  • 2
    I agree, but I was indeed worried about the k parameter -- if the unbalance will create some class-wise differences in the observation densities in the feature space, the same k will tend to a smaller sphere in feature space for an observation from denser class. Won't it influence the k parameter optimization then? –  Jul 21 '10 at 08:45
  • 3
    Could you clarify whether you are writing about KNN or K-means, both of which you mention explicitly? – whuber Mar 03 '19 at 21:45
  • 2
    @peter-smit is it possible that you are confusing KNN and K-means here? – Renaud May 28 '19 at 15:47