I am performing k-mean clustering on a demographic data-set. I have taken k $= 3$ and each time I run this clustering process in a software, I get different set of clusters. Now, I am not sure which result is to be considered as final. I understand why each time it produces different clusters but how do I figure out which cluster is the most appropriate one? Is this a subjective choice?
-
If your question also depends on software, why don't you include the information which software you are using in your post? – Ferdi Oct 26 '16 at 07:16
-
No, my question doesn't depend on software. You can perform clustering (for a fixed k) any number of times in any software (matlab, mathematica) and you will get different results each time.. – Dark_Knight Oct 26 '16 at 07:27
-
k-means solution depends on the initial selection of the centriods. Agree with @Ferdi. – L.V.Rao Oct 26 '16 at 07:28
-
@L.V.Rao Exactly! that's why it gives different clusters every number of time i do clustering. I suppose this thing is common for all statistical tool which perform k-mean clustering. – Dark_Knight Oct 26 '16 at 07:30
-
Why k=3? Have tried for other solutions? – L.V.Rao Oct 26 '16 at 07:30
-
@Dark_Knight In this case a k-mean++ algorithm might be helpful. – Ferdi Oct 26 '16 at 07:30
-
Furthermore K-means does not work properly for overlapping cluster. Maybe you have overlapping clusters. – Ferdi Oct 26 '16 at 07:32
-
@L.V.Rao Yes, I have tried for k=2 and 4. For k=2, it shows same clustering all the time(I ran it 5 times) but for k=4 it shows different clustering all the time. – Dark_Knight Oct 26 '16 at 07:35
-
@Ferdi I am not familiar with k-mean++ algorithm. How do I detect if my data set has overlapping clusters? – Dark_Knight Oct 26 '16 at 07:36
-
Does k=2 makes sense for you? Clustering is an explorative analysis. Can you interpret the clusters? – L.V.Rao Oct 26 '16 at 07:38
-
@Dark_Knight The most straightforward way to detect overlapping clusters is by visualising the data. – Ferdi Oct 26 '16 at 07:39
-
@Ferdi is right. How many variables are used? – L.V.Rao Oct 26 '16 at 07:42
-
@L.V.Rao When I ran it 6th time for k=2, it gave me different set of clustering. I have demographic data for a large country and I don't think k=2 is the appropriate no. for clusters. Total 7 variables are used. for 28 different counties.. – Dark_Knight Oct 26 '16 at 07:46
-
K-means can only be used if you already KNOW FOR SURE the number of clusters. If you don't know the number of clusters use hierarchical clustering. – Ferdi Oct 26 '16 at 07:48
-
@Ferdi Ok.. but if I know for sure the value of k then in that case if I encounter the same problem?? – Dark_Knight Oct 26 '16 at 07:50
-
@Dark_Knight: http://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means One of the cases in the posts will be true in your case – Ferdi Oct 26 '16 at 07:56
-
@Ferdi I'll read it. It will take some time.. :) – Dark_Knight Oct 26 '16 at 08:07
-
The differences in the variances of the variables may be effecting the solution. – L.V.Rao Oct 26 '16 at 08:09
-
1@Dark_Knight all the best. – L.V.Rao Oct 26 '16 at 08:10
-
Unclear what you are asking. Hre you are selecting clusters, there you are selecting cluster. What are you selecting? – ttnphns Oct 26 '16 at 09:17
-
@ttnphns I am just selecting clusters. At this point I am not focusing on selecting the size of cluster (which i know is important). I am just focusing on choosing the appropriate cluster for any relevant k.. – Dark_Knight Oct 26 '16 at 10:17
3 Answers
You can use the clustering that minimizes the sum of variances within the clusters.
This is also used when determining the optimal $k$, in a tradeoff against $k$, since increasing $k$ will reduce the variance - but of course you can just as easily compare different clusterings with the same $k$. The $k$ term drops out, and you are essentially left with the within-cluster variance.
Alternatively, you can look at the silhouettes, which evaluates the separation of clusters. This is also commonly used to determine $k$ but can certainly be used to compare different clusterings with the same $k$.

- 95,027
- 13
- 197
- 357
If you get very different results every time, probably none of them is good.
If k-means works well, most seeds will yield the same result (except for enumeration of clusters).

- 39,639
- 7
- 61
- 96
I would try with other k values to see if the clustering results are significantly different or not. In addition, there are some centroid initialization algorithms that can eliminate the random factor, which would help you stabilize the results. For example, see this.
Also, you may want to use some internal indices to evaluate the clustering solutions.
-
The question is poorly word.ed. Start by saying what kind of data you have if any. Then explain what clustering method you are looking at. It is okay to use the links after you explain the problem. – Michael R. Chernick Jan 12 '17 at 06:13