I have a big dataset of M
sequences of [1 - N]
events, where each event has multiple properties (start date, end date, location, and more contextual features).
For each sequence of [1-N]
events I want to find up to K (1<=K<=N)
subsequences (clusters) based on events similarity.
For example, the sequence of the following events: [A B C D]
EVENT START_DATE END_DATE LON LAT
A 2018-01-01 2018-01-02 10 15
B 2018-01-02 2018-01-03 10 15
C 2018-02-01 2018-01-08 20 30
D 2018-03-01 2018-03-03 10 15
can be split into:
[[A] [B] [C] [D]]
,[[A B] [C] [D] ]
, [[A][B C][D]]
, [[A] [B] [C D]]
, [[A B] [C D]]
, [[A B C][D]]
,[[A] [B C D]]
, [[A B C D]]
Where the obvious split is [[A B][C][D]]
since A
and B
are consecutive in same location with similar time-span, C
happens a month after in a different location and D
happens in the same location but 2 month later.
Some assumptions/special treatment might apply:
- Same clusters appear sequentially in terms of time (
[[A C] [B] [D]]
is not possible) - Different features might have different weights (time proximity might be more important that date proximity
- Number of clusters is unknown
- Ground truth is unknown, but can be determined in a quite accurate way by human labelling.
- The calculation needs to be realtime (<1s for M<100), but can be calculated incrementally (recalculated after addition of a new event)
Any suggestions of what algorithm can be useful for this problem? Classical clustering here is not suitable and I'm looking for other options.