1

I have a data set of 54000 genes and I used different methods for clustering such as HAC, K-means, model based clustering and CLARA. The objective is to compare these methods. I used the Adjusted Rand Index. But there is something that I do not understand.

With my data set, the ARI value between a clustering result obtained by K-means and another clustering result, also obtained by K-means, with the same number of clusters (i.e I effectuated K-means two times) , is only 0.40, which is not a high value.

My question is, if the ARI value is not high for the same method compare to itself, can we use ARI to compare the clustering results for different method? And is there other index or method to compare them? I already read the topic How to select a clustering method? How to validate a cluster solution (to warrant the method choice)? but I still do not understand which methods are used to compare the clustering results.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
kaitokid
  • 93
  • 8
  • 1
    In which respect did your two k-means analyses of the same dataset differ? – ttnphns Jul 10 '19 at 02:33
  • 1
    Are your data quantitative? K-means is appropriate mostly for quantitative data. – ttnphns Jul 10 '19 at 02:37
  • Thank you for your answer. My data is quantitative and in R, I used the command kmeans(data, centers = 9) to obtain the two k-means analyses. I just run at two different times. I also ran many times, and the ARI varies between 0.11 and 1. Can we use ARI for a method which is very sensitive to the initial choice of center like K-means? – kaitokid Jul 10 '19 at 04:44
  • 1
    "At different times" What do you mean? You mean you took the same data once then again? Was initial centers the same? Well, even if they aren't we should expect (with quantitative continuous data) quite similar results anyway. Perhaps you ought to tell more about your cluster analysis. – ttnphns Jul 10 '19 at 07:45
  • I took the same data, and in each command, I just specify the number of clusters, which is always 9, and the initial centers are randomly chosen. – kaitokid Jul 10 '19 at 08:26
  • Just to be more specify about the command. The command I used in R is as follows: `adjustedRandIndex(kmeans(data, centers = 9, iter.max = 100)$cluster, kmeans(data, centers = 9, iter.max = 100)$cluster)`, where 'data' is always the genomic data that I need to do the clustering. – kaitokid Jul 10 '19 at 08:34
  • What is the dimensionality of your data (how many variables)? – ttnphns Jul 10 '19 at 08:51
  • My data is a matrix of dimension 54000 x 18 – kaitokid Jul 10 '19 at 08:59
  • 1
    Are you sure there is any cluster structure in your data? And are you correct at your number of clusters? Did you check results first by some internal cluster validity indices (such as Calinski-Harabasz)? Also see if there are redundant dimensions (do PCA - maybe better to do clustering on few PCs than on 18 dimensions?). Before that is all checked there is no much sense in doing comparisons by external validity indices such as Rand or Adj. Rand. – ttnphns Jul 10 '19 at 10:40
  • Thank you for your answer. My data is a RNA-sequencing, obtained by different experiments, and I am sure that it has cluster structure. I've just calculated the CH index for a clustering results obtained by K-means, it returns a value of 21878.45 , but I'm not really sure that this value can say anything about my cluster. – kaitokid Jul 10 '19 at 11:26
  • Take a search about Calinski-Harabasz on this site. – ttnphns Jul 10 '19 at 11:39

2 Answers2

1
  1. You are making a fallacy when saying if the ARI value is not high for the same method compare to itself, can we use ARI to compare the clustering results for different method. Cluster analysis results, most methods including K-means, are much dependent on its input "tuning" parameters (for K-means these are initial center seeds), and on data preprocessing. Your two runnings of K-means - which results you are comparing - differed, I suppose, in some of this respect (which, by the way? you haven't expressed it). Why do you expect the results must be very similar? They have not to. Especially if there is hardly any cluster structure in the data or the number of clusters was wrong. There is no reason to think, a priori and generally, that the difference in results between two runnings of the same method under different parameters ought to be less than of between two different methods.

  2. ARI's baseline (value 0) is not the absence of matching (similarity in results) but the level of chance matching. So value $0.40$ is not a low value, it is medium size value, I would say. But what is unadjusted Rand value, did you check? It will be higher.

  3. There are many "external clustering criteria" besides Rand or Adjusted Rand. See some of their formulae in the description of !cluagree SPSS macro of mine on my web page (download collection named "Compare partitions" there).

ttnphns
  • 51,648
  • 40
  • 253
  • 462
1

K-means is randomized.

Running it two times may result in quite different clusterings. In particular when it does not work well, it tends to produce very different results. On those few data sets where k-means works well, it usually produces similar results. It is perfectly in line with theory if k-means results are not similar to each other.

As mentioned in the other answer, 40% more than random (ARI, not Rand index. You may want to also report the Rand index itself) is not too bad. It probably means some of the k-means clusters agree, while others don't. That is to be expected that some do and some don't.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96