When executing the DBSCAN algorithm over multiple runs on similar data (but not the same), I would like to generate persistent ID's so we can monitor how the clusters changed over time.
Selection of another algorithm is not possible. This question also applies to similar stochastic methods such as k-means.
The goal is to minimize the % of cluster ID changes (assignments) for the same observations over time when new data is added.
i.e.: clustering on the same data twice should yield exactly the same ID numbers and the IDs contain the same points. If there are new clusters, new IDs are generated, etc.
EDIT
The goal is also to be able to reference the information contained in the clusters via third-party applications (not just for analysis).
e.g.: clustering of StackExchange questions, when you read one question, you might like to read other questions of the same type (cluster). Clusters IDs have been assigned to these questions but one would like for practical reasons be able to use the same cluster IDs to reference the same group of documents over time (multiple reclustering)
Example
{A, B}, {C, D}
are assigned cluster IDs 1, 2
In a future iteration, the clusters change to:
{A}, {B,C,D}
.
The cluster IDs remain at 1, 2
. (going by the most common cluster ID of every item from the previous iteration. for B, C, D
IDs were (1,2,2) so we choose 2)
If instead, the clusters become:
{A, B}, {C}, {D}
, cluster ID's may be assigned 1, 2, 3
or 1, 3, 2
because both possible answers minimize the number of cluster ID changes for items in the cluster.
If instead, the clusters become:
{A}, {B}, {C}, {D}
, cluster ID's may be assigned in a way that the maximum number of items are assigned the same IDs from the old clusters: cluster ID 1
is assigned to A
or B
, and cluster ID 2 is assigned to one of C
or D
and the new clusters are assigned ID's 3, 4
.