14

When we do classification and regression, we usually set testing and training sets to help us build and improve models.

However, when we do clustering do we also need to set testing and training sets? Why?

rz.He
  • 331
  • 2
  • 3
  • 7
  • Yes - for similar reasons as classification/regression. You want to make sure that whatever model you create (say your elbow plot indicates that k=3 in a k-means clustering) is still appropriate to unseen data. – ilanman Mar 21 '17 at 17:50
  • Thank you ilanman ;) Also, do you have any recommendations about how to determine the actual number of clusters when we do clusterings such as kmeans? – rz.He Mar 21 '17 at 18:19

3 Answers3

9

Yes, because clustering may also suffer from over-fitting problem. For example, increasing number of clusters will always "increase the performance".

Here is one demo using K-Means clustering:

The objective function of K-means is

$$ J=\sum_{i=1}^{k}\sum_{j=1}^{n}\|x_i^{(j)}-c_j\|^2 $$

With such objective, the lower $J$ means "better" model.

Suppose we have following data (iris data), choosing number of cluster as $4$ will always "better" than choosing number of cluster as $3$. Then choosing $5$ clusters will be better than $4$ clusters. We can continue on this track and end up with $J=0$ cost: just make number of the cluster equal to number of the data points and place all the cluster center on the corresponding the points.

d=iris[,c(3,4)]

res4=kmeans(d, 4,nstart=20)
res3=kmeans(d, 3,nstart=20)


par(mfrow=c(1,2))
plot(d,col=factor(res4$cluster),
   main=paste("4 clusters J=",round(res4$tot.withinss,4)))
plot(d,col=factor(res3$cluster),
   main=paste("3 clusters J=",round(res3$tot.withinss,4)))

enter image description here

If we have hold off data for testing, it will prevent us to over-fit. The same example, suppose we are choosing large number clusters and put every cluster center to the training data points. The testing error will be large, because testing data points will not overlap with the training data.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Hi hxd1011, thanks for your quick reply. Another question, do you have any recommendations about how to determine the actual number of clusters when we do clusterings such as kmeans? – rz.He Mar 21 '17 at 18:19
  • @rz.He yes, check this answer http://stats.stackexchange.com/questions/261537/how-to-chose-the-order-for-polynomial-regression/261544#261544 – Haitao Du Mar 21 '17 at 18:23
  • 2
    +1 because it is a constructive answer but to play devil's advocate you do know they are 3 clusters. If someone showed this data without any context a 2-cluster solution would work beautifully too. Maybe you even have some of the upper right-most points as outliers to play "real-data-have-outliers" too. It would be much more constructive (and stringent) to look at the coherence between bootstrapped/jittered/subsetted clustering runs using some statistic (eg. cophenetic correlation, Adjusted Rand-Index, etc.). – usεr11852 Mar 21 '17 at 21:28
  • And if you don't use k-means? Say, average linkage clustering? I fear thar **your answer is overfitting to k-means**. – Has QUIT--Anony-Mousse Mar 22 '17 at 07:10
  • @Anony-Mousse: The answer is particular to k-means as an example but it would be qualitatively the same if DBSCAN or spectral clustering or whatever else it was used. It just shows that a particular metric can be over-fitted. – usεr11852 Mar 22 '17 at 11:02
  • Except that DBSCAN doesn't optimize a metric? – Has QUIT--Anony-Mousse Mar 23 '17 at 01:53
7

No, this will usually not be possible.

There are very few clusterings that you could use like a classifier. Only with k-means, PAM etc. you could evaluate the "generalization", but clustering has become much more diverse (and interesting) since. And in fact, even the old hierarchical clustering won't generalize well to 'new' data. Clustering isn't classification. Many methods from classification do not transfer well to clustering; including hyperparameter optimization.

If you have only partially labeled data, you can use these labels to optimize parameters. But the general scenario of clustering will be that you want to learn more about your data set; so you run clustering several times, investigate the interesting clusters (because usually, some clusters clearly are too small or too large to be interesting!) and note down some of the insights you got. Clustering is a tool to help the human explore a data set, not a automatic thing. But you will not "deploy" a clustering. They are too unreliable, and a single clustering will never "tell the whole story".

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • 1
    Clustering reflects a global property of the data and it commonly has no "ground-truth". Having sad that, I do not think anyone advocates using a clustering as a classifier at first instance; nevertheless if we find an interesting clustering it will be foolish not to try to use the findings by incorporating them in a decision-making process. (Otherwise why did we cluster the data to begin with?) – usεr11852 Mar 22 '17 at 11:01
  • to run clustering, we still need an objective to optimize. if it is an optimization problem then it can over fit on one data. In addition to kmeans many other methods still need number of clusters. – Haitao Du Mar 22 '17 at 13:22
  • 1
    Not every clustering algorithm is an optimization problem. – Has QUIT--Anony-Mousse Mar 23 '17 at 01:54
  • 1
    And as far as using the result: you want to use the insights, not the raw result. Interpret the cluster, and work with the *interpretation*, because there will be plenty of badly assigned points. – Has QUIT--Anony-Mousse Mar 23 '17 at 01:57
  • I support this answer, Because when a new data point comes, you learn the representation and then cluster, so there is no need for the test. Even if you are splitting it then you are losing the data information. – Aaditya Ura Jan 21 '20 at 13:10
0

No. You do not use training and testing in unsupervised learning. There is no objective function in unsupervised learning to test the performance of the algorithm.

S_Dhungel
  • 17
  • 1
  • 4
    Without some more detail this is not really adding to the discussion and the two existing answers. Can you expand on it? – mdewey Jul 19 '17 at 17:27