4

I have a big dataset of M sequences of [1 - N] events, where each event has multiple properties (start date, end date, location, and more contextual features).

For each sequence of [1-N] events I want to find up to K (1<=K<=N) subsequences (clusters) based on events similarity.

For example, the sequence of the following events: [A B C D]

EVENT     START_DATE     END_DATE       LON      LAT

A         2018-01-01     2018-01-02     10       15
B         2018-01-02     2018-01-03     10       15
C         2018-02-01     2018-01-08     20       30
D         2018-03-01     2018-03-03     10       15

can be split into:

[[A] [B] [C] [D]],[[A B] [C] [D] ], [[A][B C][D]], [[A] [B] [C D]], [[A B] [C D]], [[A B C][D]],[[A] [B C D]], [[A B C D]]

Where the obvious split is [[A B][C][D]] since A and B are consecutive in same location with similar time-span, C happens a month after in a different location and D happens in the same location but 2 month later.

Some assumptions/special treatment might apply:

  • Same clusters appear sequentially in terms of time ([[A C] [B] [D]] is not possible)
  • Different features might have different weights (time proximity might be more important that date proximity
  • Number of clusters is unknown
  • Ground truth is unknown, but can be determined in a quite accurate way by human labelling.
  • The calculation needs to be realtime (<1s for M<100), but can be calculated incrementally (recalculated after addition of a new event)

Any suggestions of what algorithm can be useful for this problem? Classical clustering here is not suitable and I'm looking for other options.

Dimgold
  • 318
  • 3
  • 15

1 Answers1

1

Try generalized DBSCAN.

Define two thesis:

  1. Distance, points sold be "near" each other
  2. Time, events should be close in time

Don't use a standard implementation. Instead exploit that you can sort the data by time, and only need to consider the events within the time threshold.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • I'm sorry but I can't see how it relates to *frequent subsequence*. I'm not looking for the popular subsequences (they probably won't repeat at all), but looking how to group events in the sequence into clusters. (reference for FS - http://ai.ms.mff.cuni.cz/~sui/sequential_mining.pdf) – Dimgold Mar 27 '19 at 08:11
  • 1
    Then you need to explain your problem better, what kind of patterns you are looking for. – Has QUIT--Anony-Mousse Mar 27 '19 at 16:00
  • 2
    So give examples of the actual input and output, not A B C if these aren't items in a sequence, and how a cluster is supposed to be defined. – Has QUIT--Anony-Mousse Mar 28 '19 at 18:24