3

My goal is to cluster data (20000 samples with a range from 0.0 to 1.0, and 14 dimensions/features). Since I don't know the number of clusters, I tried using MeanShift and DBSCAN.

My problem with these algorithms is that they find one large cluster with about 90%+ of the data and a few very small clusters, or hundreds of just small clusters depending on the parameters.

I am trying to find large groups with similarity. I am looking for a small number (like 3-20) of clusters with 100 - 5000 samples per cluster. I want to try algorithms that could find these clusters. I am also aware that there might not be any meaningful clusters.

Here are the scatterplots of my data (each dimension), maybe this helps to give me some tips of what algorthims (or parameters) I should try.

enter image description here

Flitschi
  • 31
  • 1
  • 3
    0. Welcome to CV.SE. 1. What is the similarity metric used and is it relevant for this data? 2. Why do we expect clusters? Are they known to exist based on prior knowledge? 3. How you tried K-means with some visualisation of the Gap statistic across K? 4. "*one large cluster with about 90%+ of the data and a few very small clusters*" might be fine if looking in a largely homogeneous population with some "outlierish" subgroup. 5. Use dimensional reduction (e.g. PCA) with few components and she how this looks too. – usεr11852 Dec 29 '21 at 12:00
  • What is the gap statistics for K-means ? Do you mean the Silohuette scores ? – Thomas Dec 29 '21 at 12:25
  • 1
    Clustering is hard in high dimensions. [This recent thread and references therein may be interesting reading.](https://stats.stackexchange.com/q/558567/1352) – Stephan Kolassa Dec 29 '21 at 13:03

0 Answers0