How to optimize performance of cluster model without any ground truth?

Question

I had a general question on what to do when no ground truth data is available and clustering is initiated.

Are there still metrics which can indicate how good or bad the clustering worked on the "baseline" data set? I am not sure how to tune the initial model without appropriate measures for (internal) model performance on the baseline data.

What is a best practice?

We have many questions in the [tag:clustering] tag that may be helpful, e.g., [How to decide on the correct number of clusters?](https://stats.stackexchange.com/q/23472/1352) I don't see a better way than deciding on the correct number of clusters to evaluate a clustering, so I'll vote to close as a duplicate of that one. If you have something else in mind by "how good or bad the clustering worked", please consider clarifying. — Stephan Kolassa, Sep 18 '19 at 04:44
I did not understand this: `to tune the initial model without appropriate measures for (internal) model performance on the baseline data`. — ttnphns, Sep 18 '19 at 09:40

score 0 · Answer 1 · answered Sep 18 '19 at 09:35

This article lists a number of internal cluster validity indices: https://www.ncbi.nlm.nih.gov/pubmed/26389570.

These internal indices are used to determine the quality of your clusters and some of these have been implemented in sklearn. Considering your tag I am going to assume that you are using Python.

Here is a link to the cluster performance metrics implemented in sklearn, there is also an example: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Hope this answered your question.

How to optimize performance of cluster model without any ground truth?

1 Answers1