Identifying sequential patterns

Question

I am working with sequence data which are long lists of malware win-api calls. I am trying to cast the problem of identifying 'malware behavior' into one of finding sequential patterns. I treat each api call as a single item Itemset. The number of different possible items (api calls) is quite large.

Now, when I apply the SPADE algorithm (see also, Zaki, SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, 42, 31–60, 2001) I run into memory problems. Is there a better alternative way to find sequential patterns among large high vocabulary sequences?

Could you use a Markov-Chain Monte-Carlo approach? – Zach Aug 22 '11 at 16:47 — Zach, Aug 22 '11 at 16:47

score 2 · Answer 1 · answered Aug 23 '11 at 04:22

You can map the data into a feature space where sequence is important, along with both statistics calculated over sliding windows & cumulative statistics, and use that in a decision tree.

A decision tree could handle both sequences and non-sequential data. This may substantially reduce your data complexity.

score 1 · Answer 2 · answered Apr 01 '14 at 18:17

You may try other sequential pattern mining algorithm.

For example, the open-source SPMF java data mining library offers SPADE, but also PrefixSpan, SPAM, CM-SPAM, CM-SPADE, GSP, etc (by the way, I'm the project founder). To my knowledge CM-SPADE usually is faster than SPADE. In terms of memory perhaps that SPAM uses less memory.. You could try it.

Identifying sequential patterns

2 Answers2