Appropriate cluster in k-mean clustering

Question

I am performing k-mean clustering on a demographic data-set. I have taken k $= 3$ and each time I run this clustering process in a software, I get different set of clusters. Now, I am not sure which result is to be considered as final. I understand why each time it produces different clusters but how do I figure out which cluster is the most appropriate one? Is this a subjective choice?

If your question also depends on software, why don't you include the information which software you are using in your post? — Ferdi, Oct 26 '16 at 07:16
No, my question doesn't depend on software. You can perform clustering (for a fixed k) any number of times in any software (matlab, mathematica) and you will get different results each time.. — Dark_Knight, Oct 26 '16 at 07:27
k-means solution depends on the initial selection of the centriods. Agree with @Ferdi. — L.V.Rao, Oct 26 '16 at 07:28
@L.V.Rao Exactly! that's why it gives different clusters every number of time i do clustering. I suppose this thing is common for all statistical tool which perform k-mean clustering. — Dark_Knight, Oct 26 '16 at 07:30
@Dark_Knight In this case a k-mean++ algorithm might be helpful. — Ferdi, Oct 26 '16 at 07:30
Furthermore K-means does not work properly for overlapping cluster. Maybe you have overlapping clusters. — Ferdi, Oct 26 '16 at 07:32
@L.V.Rao Yes, I have tried for k=2 and 4. For k=2, it shows same clustering all the time(I ran it 5 times) but for k=4 it shows different clustering all the time. — Dark_Knight, Oct 26 '16 at 07:35
@Ferdi I am not familiar with k-mean++ algorithm. How do I detect if my data set has overlapping clusters? — Dark_Knight, Oct 26 '16 at 07:36
Does k=2 makes sense for you? Clustering is an explorative analysis. Can you interpret the clusters? — L.V.Rao, Oct 26 '16 at 07:38
@Dark_Knight The most straightforward way to detect overlapping clusters is by visualising the data. — Ferdi, Oct 26 '16 at 07:39
@L.V.Rao When I ran it 6th time for k=2, it gave me different set of clustering. I have demographic data for a large country and I don't think k=2 is the appropriate no. for clusters. Total 7 variables are used. for 28 different counties.. — Dark_Knight, Oct 26 '16 at 07:46
K-means can only be used if you already KNOW FOR SURE the number of clusters. If you don't know the number of clusters use hierarchical clustering. — Ferdi, Oct 26 '16 at 07:48
@Ferdi Ok.. but if I know for sure the value of k then in that case if I encounter the same problem?? — Dark_Knight, Oct 26 '16 at 07:50
@Dark_Knight: http://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means One of the cases in the posts will be true in your case — Ferdi, Oct 26 '16 at 07:56
The differences in the variances of the variables may be effecting the solution. — L.V.Rao, Oct 26 '16 at 08:09
Unclear what you are asking. Hre you are selecting clusters, there you are selecting cluster. What are you selecting? — ttnphns, Oct 26 '16 at 09:17
@ttnphns I am just selecting clusters. At this point I am not focusing on selecting the size of cluster (which i know is important). I am just focusing on choosing the appropriate cluster for any relevant k.. — Dark_Knight, Oct 26 '16 at 10:17

score 2 · Answer 1 · edited May 23 '17 at 12:39

You can use the clustering that minimizes the sum of variances within the clusters.

This is also used when determining the optimal $k$, in a tradeoff against $k$, since increasing $k$ will reduce the variance - but of course you can just as easily compare different clusterings with the same $k$. The $k$ term drops out, and you are essentially left with the within-cluster variance.

Alternatively, you can look at the silhouettes, which evaluates the separation of clusters. This is also commonly used to determine $k$ but can certainly be used to compare different clusterings with the same $k$.

score 2 · Answer 2 · answered Oct 26 '16 at 20:27

2

If you get very different results every time, probably none of them is good.

If k-means works well, most seeds will yield the same result (except for enumeration of clusters).

answered Oct 26 '16 at 20:27

Has QUIT--Anony-Mousse

39,639
7
61
96

score 0 · Answer 3 · edited Apr 13 '17 at 12:44

0

I would try with other k values to see if the clustering results are significantly different or not. In addition, there are some centroid initialization algorithms that can eliminate the random factor, which would help you stabilize the results. For example, see this.

Also, you may want to use some internal indices to evaluate the clustering solutions.

edited Apr 13 '17 at 12:44

Community

1

answered Jan 12 '17 at 05:59

foo

145
2
2
8

The question is poorly word.ed. Start by saying what kind of data you have if any. Then explain what clustering method you are looking at. It is okay to use the links after you explain the problem. – Michael R. Chernick Jan 12 '17 at 06:13

Appropriate cluster in k-mean clustering

3 Answers3