3

In short: I am using k-means clustering with correlation distance. How to check, how many clusters should be used, if any?

There are many indices and answers on how to establish a number of clusters when grouping data: example 1, example 2, etc. For now, I am using Dunn's index, but it is not sufficient due to one of the reasons described below.

All those approaches exhibit at least one of following problems, I have to avoid:

Indexes:

  • clustering quality index derivation makes some assumptions regarding data covariance matrix, i.e. since such moment only euclidean or euclidean-like metrics apply - correlation one is not an option anymore
  • it requires at least two nonempty clusters to compare already calculated partitions - there is no possibility to state whether there is any reason to make a division into groups at all

Clustering approaches:

  • clustering approaches estimating number of clusters itself (e.g. affinity propagation) are much slower and do not scale well

To sum up: is there any criterion or index, which allows to check for existence of groups in data (maybe estimating number of them), without limitation on metric used?

EDIT: Space I am operating on has up to few thousands features.

gmrukwa
  • 31
  • 3
  • 2
    You may want to look at [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) instead of k-means, which automatically determines the number of clusters - potentially just one. – Stephan Kolassa May 24 '17 at 08:09
  • @StephanKolassa in high dimensionality DBSCAN enters O(n^2) what scales poorly. I am operating on a space with up to few thousands features. – gmrukwa May 24 '17 at 12:24
  • There isn't anything below O(n²). K-means has worst case even worse than O(n²) if I recall correctly. And if I'm not mistaken, Dunn's index is O(n²) too (Silhouette definitely is). – Has QUIT--Anony-Mousse May 30 '17 at 07:39
  • @Anony-Mousse, it depends on how it's calculated. Dunn's index can be calculated differently, depending on the definition of inter- and intracluster distances, there are some options (something like 18 variations). I use those with O(n) complexity - they work pretty well, but none will ever provide 1-cluster case. For k-means itself, Lloyd's algorithm reaches O(nkdi) where n is number of observations, k - number of clusters, d - dimensionality, i - number of iterations. – gmrukwa May 30 '17 at 07:47
  • 1
    k-means: the problem is that the number of iterations, i, is exponential in n, see https://en.wikipedia.org/wiki/K-means_clustering#Complexity look for "superpolynomial". So worse than O(n³). P.S. k-means should *only* be used with Bregman divergences, such as squared Euclidean. – Has QUIT--Anony-Mousse May 30 '17 at 13:09
  • https://stackoverflow.com/questions/21323784/absolute-pearson-correlation-distance-in-kmeans-matlab – Has QUIT--Anony-Mousse May 30 '17 at 13:12
  • @Anony-Mousse, thanks for your explanation. In my application, it suffices to keep maximal number of iterations limited. That's why I did not connect it above with n. Isn't the problem with correlation distance resolved by proper data transformation? I.e. normalization of all observations to have mean 0 and norm 1? Also, data is specific in such a way, that no negative correlation is found (case you shown - with zero mean - does not occur). – gmrukwa May 30 '17 at 13:43
  • Then pearson correlation is cosine, and spherical k-means should be okay to use, yes. I don't know if the results are actually meaningful. What are the least squares you minimize, does this optimize your problem? – Has QUIT--Anony-Mousse May 30 '17 at 15:26

0 Answers0