3

In some clustering algorithm, ex: K-Means cluster, it is very sensitive with outliers, so we need to remove outliers before aplly K-Means, or it will be bad clustering. So :

  1. How can we know some points are outliers if we can not plot it ( high dimension data ) ?
  2. how can we know its K-Means model is good or bad ? Because it is unsupervised learning, so we can not calculate accuracy rate ( something likes F1 score ,... ). Or do we have any method to know an unsupervised learning model is good or bad ?
voxter
  • 242
  • 2
  • 8

1 Answers1

3

The answer is checking the loss (sum of distance in k-means setting) on testing data set.

In unsupervised learning, we also need both training and testing data set. Because it is also common to have over-fitting in unsupervised setting. For example, in K-means if we increase number of clusters $k$, the clustering performance (loss function / sum of distance) will always get better in the training data, and eventually will over-fit the data. The extreme example, would be make every data point as a cluster center, then loss is $0$.

To evaluate how good is the clustering algorithm, one valid approach is testing if the model (number of clusters and centers) is still good on a held out testing data.


I was talking about unsupervised learning in general (including probablistic model, mixture of Gaussian etc.), for K-means specifically, you can find answer here (if not duplicate).

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Why won't increasing k lower the loss value on a test set? – eric_kernfeld Sep 27 '17 at 13:53
  • 1
    @eric_kernfeld think about the extreme example, when we put a cluster center in every data point in training. The testing data should have different data points. In such a case, training loss is 0, but testing is definitely not, unless two data sets identical. – Haitao Du Sep 27 '17 at 13:58
  • 2
    What you say is true, but not exactly what I am getting at. Suppose you have training and test sets of size 200 and you increase the number of centers from 8 to 16. Don't you think the test set loss will decrease? – eric_kernfeld Sep 27 '17 at 14:10
  • Also, how do you use this tactic to compare methods? Two methods may have completely different loss functions, or the loss functions may be arbitrarily scaled. – eric_kernfeld Sep 27 '17 at 14:11
  • @eric_kernfeld comparing between different models is another question and not what OP original asked. – Haitao Du Sep 27 '17 at 14:55
  • Depends on error on train set and test set, how can i know it is good or bad ? We will compare error on train set and test set with same number of cluster, or we will compare error on train set and test set between variant numbers of cluster ? – voxter Sep 28 '17 at 01:35