Unsupervised clustering of sequence of events to subsequences

Question

I have a big dataset of M sequences of [1 - N] events, where each event has multiple properties (start date, end date, location, and more contextual features).

For each sequence of [1-N] events I want to find up to K (1<=K<=N) subsequences (clusters) based on events similarity.

For example, the sequence of the following events: [A B C D]

EVENT     START_DATE     END_DATE       LON      LAT

A         2018-01-01     2018-01-02     10       15
B         2018-01-02     2018-01-03     10       15
C         2018-02-01     2018-01-08     20       30
D         2018-03-01     2018-03-03     10       15

can be split into:

[[A] [B] [C] [D]],[[A B] [C] [D] ], [[A][B C][D]], [[A] [B] [C D]], [[A B] [C D]], [[A B C][D]],[[A] [B C D]], [[A B C D]]

Where the obvious split is [[A B][C][D]] since A and B are consecutive in same location with similar time-span, C happens a month after in a different location and D happens in the same location but 2 month later.

Some assumptions/special treatment might apply:

Same clusters appear sequentially in terms of time ([[A C] [B] [D]] is not possible)
Different features might have different weights (time proximity might be more important that date proximity
Number of clusters is unknown
Ground truth is unknown, but can be determined in a quite accurate way by human labelling.
The calculation needs to be realtime (<1s for M<100), but can be calculated incrementally (recalculated after addition of a new event)

Any suggestions of what algorithm can be useful for this problem? Classical clustering here is not suitable and I'm looking for other options.

Has QUIT--Anony-Mousse · Answer 1 · 2019-04-01T00:58:32.717

1

Try generalized DBSCAN.

Define two thesis:

Distance, points sold be "near" each other
Time, events should be close in time

Don't use a standard implementation. Instead exploit that you can sort the data by time, and only need to consider the events within the time threshold.

edited Apr 01 '19 at 00:58

answered Mar 26 '19 at 07:11

Has QUIT--Anony-Mousse

39,639
7
61
96

I'm sorry but I can't see how it relates to *frequent subsequence*. I'm not looking for the popular subsequences (they probably won't repeat at all), but looking how to group events in the sequence into clusters. (reference for FS - http://ai.ms.mff.cuni.cz/~sui/sequential_mining.pdf) – Dimgold Mar 27 '19 at 08:11
1

Then you need to explain your problem better, what kind of patterns you are looking for. – Has QUIT--Anony-Mousse Mar 27 '19 at 16:00
2

So give examples of the actual input and output, not A B C if these aren't items in a sequence, and how a cluster is supposed to be defined. – Has QUIT--Anony-Mousse Mar 28 '19 at 18:24

Unsupervised clustering of sequence of events to subsequences

1 Answers1