1

I have one dataset of two variables (x,y). When the data is plotted in a 2D diagram, I see some data points create a good cluster (i.e., green points), while the other data points are scattered randomly (i.e., red points).

enter image description here

I am interested in running a clustering algorithm to identify the smaller cluster (i.e., green points) buried inside of other objects or other clusters.

Would you please help me which clustering algorithm typically helps me better in this case? kNN?

Thank you!

Harry UNL
  • 39
  • 6

3 Answers3

1

The "cluster" you have manually identified is pretty arbitrary. For instance the green points on the left don't seem to be more dense than the red around them.

However, when it comes to finding clusters "nested" within others, you could :

  • Try to remap your data using some kind of kernel, and apply clustering in that remapped space. This may "distort" the feature space and allow to cluster better. However finding the right kernel is going to be laborious.
  • Use a mixture algorithm, especially Gaussian Mixture clustering. In your case, it might manage to identify these clusters as one "spread out" normal distribution, mixed with a more concentrated distribution in the middle.
Youloush
  • 878
  • 5
  • 10
0

My immediate question would be how you happen to know the red vs. green distinction. If this is known apriori, then what is the purpose of clustering? The idea is that clustering is usually used as an unsupervised (group membership unknown) learning algorithm in which the unknown groups are identified via the algorithm.

Having said that, a viable approach would be to consider a transformation before applying the clustering algorithm. A fancy name people usually call such a transformation is applying a kernel. Conceptually, under the transformed coordinate(s), the groups will be "separated better".

For example, see https://stats.stackexchange.com/a/133694/128491. In particular, the "spherical" clusters and how transformation to polar coordinates helps clustering.

Just_to_Answer
  • 240
  • 1
  • 6
0

The setting matches the concept of DBSCAN. But there may be too much noise to make it easy to find suitable parameters. Try a rather large minPts of 10 or 20 first.

There is no kNN clustering. Do you have labrls? How was the green obtained?

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96