I am trying to perform topic modeling on text data, ie. cluster the text messages by topic. I am approaching this by using a BERT model to get sentence embeddings, then use T-sne to reduce the dimensionality of the embeddings to an 8-dimensional space. In the end, I use HDBSCAN to cluster the dimensionally-reduced embeddings.
When i do so, about 40% of the data points in the train set are labelled/clustered as -1 (noise). When predicting on new data, 60% of points get labelled as -1. This is really high fraction because i know most of the data should belong to a topic, and I am also setting the HDBSCAN parameter min_samples = 1
.
I have seen other people also face such issue with hdbscan. Does this scenario mean that i might be doing something wrong and need to make more adjustments to my data, or such a scenario is typical for hdbscan? If the latter is true, should i resort to soft-clustering with hdbscan, or try another clustering method completely?