How to evaluate the accuracy of clustering text data

Question

I have text data from customer inquiries, and want to figure out what are the main topics customers enquire about.

I am approaching this by using a pre-trained BERT SentenceTransformer model ('paraphrase-MiniLM-L6-v2') to embed the sentences first, and then use HDBSCAN to cluster the embeddings.

My question is - What metric should i be using to evaluate the accuracy of clusters? By accuracy I mean having inquiries of the same topic fall in the same cluster. This is also important so that I know if i am making improvements while fine-tuning the pretrained model or changing the clustering parameters.

My attempt so far - One idea I have had to measure the accuracy is to compute what percent of our replies to customer inquiries get assigned to the same cluster that their corresponding inquiry was assigned to. The hypothesis here is that an inquiry and its reply belong to the same topic, so they should get assigned to the same cluster. Does this sound reasonable?

Are you going to compare a cluster solution with a "ground truth" or "reference" partition? — ttnphns, Oct 23 '21 at 03:05
no @ttnphns I don't have a ground truth ( true label) to compare it to. This is totally unsupervised, with no label data. — Fiori, Oct 25 '21 at 15:12
Check https://stats.stackexchange.com/q/195456/3277, then https://stats.stackexchange.com/a/358937/3277 — ttnphns, Oct 26 '21 at 08:08

score 0 · Answer 1 · answered Oct 25 '21 at 08:59

The standard evaluation of clustering given some reference groups is so-called V-Measure inspired by F-measure that is often used for evaluation of classification problems.

V-Measure is a harmonic mean of homogenity and completess.

Intuitively homogeneity says how "pure" the clusters are, i.e., it is one if each cluster contains only data points from the same topic. This is trivially satisfied if each data point is a cluster on its own - a degenerate solution that we want to avoid.
Completeness says to what extent the topics are broken into clusters. This is also has a trivial solution: to create only one cluster for the entire dataset and all topics will be complete in one cluster.

Because people typically want to optimize both these criteria at the same time, they use the harmonic mean of these two, which is called the V-Measure.

The V-measure is implemented in Scikit Learn in sklearn.metrics.

Hi @Jindřich thank you for the reply, this is very helpful to know. Unfortunately in this case i don't have any ground truth label to work with. — Fiori, Oct 25 '21 at 15:19

How to evaluate the accuracy of clustering text data

1 Answers1