2

I am trying to perform topic modeling on text data, ie. cluster the text messages by topic. I am approaching this by using a BERT model to get sentence embeddings, then use T-sne to reduce the dimensionality of the embeddings to an 8-dimensional space. In the end, I use HDBSCAN to cluster the dimensionally-reduced embeddings.

When i do so, about 40% of the data points in the train set are labelled/clustered as -1 (noise). When predicting on new data, 60% of points get labelled as -1. This is really high fraction because i know most of the data should belong to a topic, and I am also setting the HDBSCAN parameter min_samples = 1.

I have seen other people also face such issue with hdbscan. Does this scenario mean that i might be doing something wrong and need to make more adjustments to my data, or such a scenario is typical for hdbscan? If the latter is true, should i resort to soft-clustering with hdbscan, or try another clustering method completely?

Fiori
  • 81
  • 4
  • Although t-SNE reduces the data to two or three dimensions does it diminish the semantic signal from the text too much? The two and three dimensions used by t-SNE are good for graphing, but since you know what phrases are similar (by reading them) maybe graphing is not necessarily of prime concern? Soft-clustering seems a good idea and also HDBSCAN directly, that is without intermediate t-SNE. – Single Malt Nov 01 '21 at 19:29
  • “In conclusions, use t-SNE for visualization (and try different parameters to get something visually pleasing!), but rather do not run clustering afterwards, in particular do not use distance- or density based algorithms, as this information was intentionally (!) lost.”https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne – Single Malt Nov 01 '21 at 20:58
  • I have the same issue. – Jinhua Wang Nov 14 '21 at 15:15
  • thank you @SingleMalt The issue is that original embeddings are 384 dimensional, so clustering will suffer on such high dimension. I need to reduce the dimensionality before clustering as a result. Do you know of any better approach than t-sne? – Fiori Nov 16 '21 at 16:24
  • Am unfamiliar with how BERT works. Would try copying whatever has been done in the literature for BERT or similar models. Soft hdbscan could be used as a baseline. With hdbscan, can you reduce minPts and increase Eps to reduce points identified as noise? – Single Malt Nov 16 '21 at 20:42
  • [Bertopic](https://github.com/MaartenGr/BERTopic) implements this approach using UMP as a reduction method. However I still have the same issue with getting half of the data labeled as -1. – Shohreh Dec 15 '21 at 08:43

1 Answers1

0

there are two parameters to set in dbscan, one minPts, another is distance eps for searching neighbors.

https://en.wikipedia.org/wiki/DBSCAN

Make the eps larger and make the minPts smaller, will solve your problem (most data are not in clusters). What you did is setting the minPts to 1, but not setting the eps.

Try a larger number on eps

Haitao Du
  • 32,885
  • 17
  • 118
  • 213