8

I am working with sequence data which are long lists of malware win-api calls. I am trying to cast the problem of identifying 'malware behavior' into one of finding sequential patterns. I treat each api call as a single item Itemset. The number of different possible items (api calls) is quite large.

Now, when I apply the SPADE algorithm (see also, Zaki, SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, 42, 31–60, 2001) I run into memory problems. Is there a better alternative way to find sequential patterns among large high vocabulary sequences?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
chet
  • 285
  • 1
  • 5

2 Answers2

2

You can map the data into a feature space where sequence is important, along with both statistics calculated over sliding windows & cumulative statistics, and use that in a decision tree.

A decision tree could handle both sequences and non-sequential data. This may substantially reduce your data complexity.

Iterator
  • 2,294
  • 1
  • 15
  • 22
1

You may try other sequential pattern mining algorithm.

For example, the open-source SPMF java data mining library offers SPADE, but also PrefixSpan, SPAM, CM-SPAM, CM-SPADE, GSP, etc (by the way, I'm the project founder). To my knowledge CM-SPADE usually is faster than SPADE. In terms of memory perhaps that SPAM uses less memory.. You could try it.

Phil
  • 135
  • 3
  • 15