10

I read a lot of papers that test k-means with many datasets that are not normally distributed like the iris dataset and get good results. Since, I understand that k-means is for normally distributed data, why is k-means being used for non normally distributed data?

For example, the paper below modified the centroids from k-means based on a normal distribution curve, and tested the algorithm with the iris dataset that is not normally distributed.

nearly all inliers (precisely 99.73%) will have point to-centroid distances within 3 standard deviations () from the population mean.

Is there something that I'm not understanding here?

smci
  • 1,456
  • 1
  • 13
  • 20
user
  • 195
  • 6
  • 1
    What if this simply is a quite bad paper? Does not sound like a high class venue to me. – Has QUIT--Anony-Mousse Sep 02 '19 at 05:51
  • 1
    The claim you quote from the paper is preceded by the assumption that the data is normal. What's needlessly restrictive in that paper is the claim that k-means assumes normality, suggesting that it couldn't be a satisfactory clustering procedure if the data isn't jointly normal. – CloseToC Sep 02 '19 at 07:58
  • the paper is published in IEEE – user Sep 02 '19 at 14:11
  • 1
    My question related to the experimental of iris data set in the same paper, As I notice that iris data set not normally distributed.https://www.kaggle.com/saurabh00007/iriscsv – user Sep 02 '19 at 14:14
  • Well, did you check what % of inliers in `iris` dataset actually lie within 3 s.d. of the centroids? It likely still happens to be true, it just doesn't *automatically* follow if the distribution isn't normal. Presumably the authors just need to add a one-liner clarifying that. – smci Sep 02 '19 at 21:10
  • "To construct the contaminated sets, the magnitudes of errors added are informed by the range of values in each dataset. For Iris, the errors added to each feature are integer values sampled from the set {: 3 ≤ ≤ 5}" page 5 in the paper. – user Sep 03 '19 at 16:10

2 Answers2

13

Here is the full quote:

K-means, being an instance of the Gaussian Mixture Model (GMM), assumes Gaussian data distribution [20][26]. It then follows that nearly all inliers (precisely 99.73%) will have point- to-centroid distances within 3 standard deviations ($\sigma$) from the population mean.

It appears in section IV.A.

The application to the Iris dataset, which, as you note, is not normally, distributed, appears in section V ("Experiments").

I do not see a logical problem with first noting an algorithm's properties under certain assumptions, such as normality, and then testing it in cases where the assumption is not valid.

And of course, k-means can be applied to any dataset. Whether it yields useful results is a different matter.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thank you, assumption of point to mean within standard deviation can be acceptable in a non-normal distributed data set? – user Sep 02 '19 at 14:32
  • It depends on the distribution you assume. – Stephan Kolassa Sep 02 '19 at 14:34
  • can you explain more?. If I have a skewed right distributed data set, can I add outlier values >mean+4 std and follow the same assumption of paper to detect it?. – user Sep 02 '19 at 14:38
  • If you start with a distributional assumption, you cannot just "add outliers". The probability of "outliers" depend on the distribution you are assuming. (What an "outlier" is is also often questionable.) If something about the paper is unclear, it would probably be better to formulate a new question. – Stephan Kolassa Sep 02 '19 at 14:53
  • Errors are added to each feature in this paper (5% to 20% of the data set). In case something is not clear about this paper, I will ask another question. Thanks a lot. – user Sep 02 '19 at 15:20
7

I'm not sure what the question is exactly, but standard deviation isn't just defined for normal distributions. It's a measure relevant for all data distributions. The farther away you are from the mean (in terms of std) the more unlikely this point is to occur. The only thing special about the normal distribution, regarding the standard deviation is that you know the probability of a point occurring within 1, 2 or 3 standard deviations (e.g. you know that a point has a probability of 99.7% to lie within $\pm 3$ standard deviations from the mean).

This however doesn't mean that standard deviation is irrelevant for other (possibly unknown) distributions. It is still relevant, but you don't know the probability associated with it.

CaucM
  • 339
  • 1
  • 5
  • ok that i mean, but in this paper data set is not normally distributed and still assume data (99.7% to lie within ±3 standard deviations from the mean). my question related to this point – user Sep 01 '19 at 23:52
  • I think you're right. This assumption is false, in my opinion. – CaucM Sep 02 '19 at 00:02
  • 6
    `The farther away you are from the mean (in terms of std) the more unlikely this point is to occur.` This might not be true for multimodal distributions. – JAD Sep 02 '19 at 11:54
  • You know how likely it is for some event to be occurring within 1, 2 or 3 standard deviations for other distributions as well, so that isn't really special. One special thing is that for a given mean and variance, the normal distribution is the one with the most entropy, so if you only know mean and variance, you'd pick it by the principle of maximum entropy https://en.wikipedia.org/wiki/Principle_of_maximum_entropy – etarion Sep 02 '19 at 15:21
  • This rule can be worked for other distributions? – user Sep 03 '19 at 16:14