Persistent Cluster ID's for DBSCAN

Question

When executing the DBSCAN algorithm over multiple runs on similar data (but not the same), I would like to generate persistent ID's so we can monitor how the clusters changed over time.

Selection of another algorithm is not possible. This question also applies to similar stochastic methods such as k-means.

The goal is to minimize the % of cluster ID changes (assignments) for the same observations over time when new data is added.

i.e.: clustering on the same data twice should yield exactly the same ID numbers and the IDs contain the same points. If there are new clusters, new IDs are generated, etc.

EDIT

The goal is also to be able to reference the information contained in the clusters via third-party applications (not just for analysis).

e.g.: clustering of StackExchange questions, when you read one question, you might like to read other questions of the same type (cluster). Clusters IDs have been assigned to these questions but one would like for practical reasons be able to use the same cluster IDs to reference the same group of documents over time (multiple reclustering)

Example

{A, B}, {C, D} are assigned cluster IDs 1, 2

In a future iteration, the clusters change to:

{A}, {B,C,D}.

The cluster IDs remain at 1, 2. (going by the most common cluster ID of every item from the previous iteration. for B, C, D IDs were (1,2,2) so we choose 2)

If instead, the clusters become:

{A, B}, {C}, {D}, cluster ID's may be assigned 1, 2, 3 or 1, 3, 2 because both possible answers minimize the number of cluster ID changes for items in the cluster.

If instead, the clusters become:

{A}, {B}, {C}, {D}, cluster ID's may be assigned in a way that the maximum number of items are assigned the same IDs from the old clusters: cluster ID 1 is assigned to A or B, and cluster ID 2 is assigned to one of C or D and the new clusters are assigned ID's 3, 4.

Don't think of the cluster IDs as identifiers. They are only a means of storage for the *sets of points*. It's just common to use integers for this, but it could be anything. Some well known implementations like the reference implementation in ELKI will not assign integer cluster IDs at all. You'll have to solve tour problem differently, not relying on the volatile "ID". — Has QUIT--Anony-Mousse, Oct 17 '17 at 05:37
thank you @Anony-Mousse, the clusters IDs will be referenced by 3d party applications to reference the contents of this cluster, and hence will need to be persistent over time. — John Zhu, Oct 17 '17 at 17:46
Don't rely on DBSCAN to provide such an ID. It will happily reuse them, for example when clusters merge. It's up to you to provide such an application-specific ID. Find the ones with the largest overlap, etc. and assign your own IDs. — Has QUIT--Anony-Mousse, Oct 17 '17 at 21:11

score 1 · Answer 1 · answered Oct 17 '17 at 06:31

1

so we can monitor how the clusters changed over time

This is your key question: you want to compare clusters. For that, you don't need a "persistent" ID.

If one clustering clustered $\{A,B\}, \{C\}$ and the other one clustered $\{B\}, \{A,C\}$, what would a "persistent" ID for the two clusters be? What if a third clustering was $\{A\}, \{B\}, \{C\}$ and a fourth one $\{A, B, C\}$?

Instead, solve your original problem, by comparing clusterings directly. The following earlier thread would be a good start: Understanding comparisons of clustering results.

answered Oct 17 '17 at 06:31

Stephan Kolassa

95,027
13
197
357

thank @stephan, we need a persistent ID because majority of the clusters will remain the same across multiple clustering iterations. The output of the clusters are used by many other applications and hence we need to be able to reference the clusters and hence need an ID. The cluster ID assignment goal would be to attempt to minimize adding and changing cluster ID assignments. The example of {A,B}, {C} (initially cluster 1 and 2) to {B}, {A,C} could be assigned either ID 1,2 or 2,1 either case remains the same. Third example would be 1, 3, 2, and fourth can be assigned either 1 or 2. – John Zhu Oct 17 '17 at 17:37

Persistent Cluster ID's for DBSCAN

EDIT

Example

1 Answers1