When we do classification and regression, we usually set testing and training sets to help us build and improve models.
However, when we do clustering do we also need to set testing and training sets? Why?
When we do classification and regression, we usually set testing and training sets to help us build and improve models.
However, when we do clustering do we also need to set testing and training sets? Why?
Yes, because clustering may also suffer from over-fitting problem. For example, increasing number of clusters will always "increase the performance".
Here is one demo using K-Means clustering:
The objective function of K-means is
$$ J=\sum_{i=1}^{k}\sum_{j=1}^{n}\|x_i^{(j)}-c_j\|^2 $$
With such objective, the lower $J$ means "better" model.
Suppose we have following data (iris data), choosing number of cluster as $4$ will always "better" than choosing number of cluster as $3$. Then choosing $5$ clusters will be better than $4$ clusters. We can continue on this track and end up with $J=0$ cost: just make number of the cluster equal to number of the data points and place all the cluster center on the corresponding the points.
d=iris[,c(3,4)]
res4=kmeans(d, 4,nstart=20)
res3=kmeans(d, 3,nstart=20)
par(mfrow=c(1,2))
plot(d,col=factor(res4$cluster),
main=paste("4 clusters J=",round(res4$tot.withinss,4)))
plot(d,col=factor(res3$cluster),
main=paste("3 clusters J=",round(res3$tot.withinss,4)))
If we have hold off data for testing, it will prevent us to over-fit. The same example, suppose we are choosing large number clusters and put every cluster center to the training data points. The testing error will be large, because testing data points will not overlap with the training data.
No, this will usually not be possible.
There are very few clusterings that you could use like a classifier. Only with k-means, PAM etc. you could evaluate the "generalization", but clustering has become much more diverse (and interesting) since. And in fact, even the old hierarchical clustering won't generalize well to 'new' data. Clustering isn't classification. Many methods from classification do not transfer well to clustering; including hyperparameter optimization.
If you have only partially labeled data, you can use these labels to optimize parameters. But the general scenario of clustering will be that you want to learn more about your data set; so you run clustering several times, investigate the interesting clusters (because usually, some clusters clearly are too small or too large to be interesting!) and note down some of the insights you got. Clustering is a tool to help the human explore a data set, not a automatic thing. But you will not "deploy" a clustering. They are too unreliable, and a single clustering will never "tell the whole story".
No. You do not use training and testing in unsupervised learning. There is no objective function in unsupervised learning to test the performance of the algorithm.