Which methods can help us to understand clustering model is good or bad?

Question

In some clustering algorithm, ex: K-Means cluster, it is very sensitive with outliers, so we need to remove outliers before aplly K-Means, or it will be bad clustering. So :

How can we know some points are outliers if we can not plot it ( high dimension data ) ?
how can we know its K-Means model is good or bad ? Because it is unsupervised learning, so we can not calculate accuracy rate ( something likes F1 score ,... ). Or do we have any method to know an unsupervised learning model is good or bad ?

Haitao Du · Answer 1 · 2017-09-27T13:47:43.883

3

The answer is checking the loss (sum of distance in k-means setting) on testing data set.

In unsupervised learning, we also need both training and testing data set. Because it is also common to have over-fitting in unsupervised setting. For example, in K-means if we increase number of clusters $k$, the clustering performance (loss function / sum of distance) will always get better in the training data, and eventually will over-fit the data. The extreme example, would be make every data point as a cluster center, then loss is $0$.

To evaluate how good is the clustering algorithm, one valid approach is testing if the model (number of clusters and centers) is still good on a held out testing data.

I was talking about unsupervised learning in general (including probablistic model, mixture of Gaussian etc.), for K-means specifically, you can find answer here (if not duplicate).

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

edited Sep 27 '17 at 13:47

answered Sep 27 '17 at 13:41

Haitao Du

32,885
17
118
213

Why won't increasing k lower the loss value on a test set? – eric_kernfeld Sep 27 '17 at 13:53
1

@eric_kernfeld think about the extreme example, when we put a cluster center in every data point in training. The testing data should have different data points. In such a case, training loss is 0, but testing is definitely not, unless two data sets identical. – Haitao Du Sep 27 '17 at 13:58
2

What you say is true, but not exactly what I am getting at. Suppose you have training and test sets of size 200 and you increase the number of centers from 8 to 16. Don't you think the test set loss will decrease? – eric_kernfeld Sep 27 '17 at 14:10
Also, how do you use this tactic to compare methods? Two methods may have completely different loss functions, or the loss functions may be arbitrarily scaled. – eric_kernfeld Sep 27 '17 at 14:11
@eric_kernfeld comparing between different models is another question and not what OP original asked. – Haitao Du Sep 27 '17 at 14:55
Depends on error on train set and test set, how can i know it is good or bad ? We will compare error on train set and test set with same number of cluster, or we will compare error on train set and test set between variant numbers of cluster ? – voxter Sep 28 '17 at 01:35

Which methods can help us to understand clustering model is good or bad?

1 Answers1