Cluster Sequences of data with different length

Question

I need to cluster sequences of data that have different length.

I am using Matlab and my first question is related to the method.

Is KMeans sufficient to achieve this?

IN KMeans I have to use the following command to cluster a set of data stored in an matrix A

 [IDX1,E] = kmeans(A,5);

So, my second questions has to do with the fact that I don't know how to create the matrix for my case.

My data have the following format:

1 15 1 1 13 14;
1 1 1 1 12 1 7 11 9 11 7 11 7 11 7 4 7 7 14 15 9 2;
13 1 13 15 13 2 9 2 9 2 2 2 2 2 2 2;
1 2 9 1 6 10 6 1 6 10 14 3 10;

Assume that each row belongs to a different user. What I need is to find clusters of similar behaviour/sequences. Do you know if I can proceed with KMeans and if so, how to create the matrix?

Just out of curiosity, how do you compute similarity between different rows? Or is that your question? In that case my question is what makes two rows similar and what makes them dissimilar? — micans, Dec 10 '14 at 14:36
I would say this part of my question too. I haven't found how. But, I would say practically 1 15 11 13 14 is more similar to 1 15 11 3 2 rather than 1 3 2 11 10. Is KMeans sufficient to determine patterns?I should also add that in similarity must participate the values but also the sequence of values, meaning that the closer a sequence is to another sequence of values the more similar they are. — user55534, Dec 10 '14 at 14:40
Unfortunately, it is difficult to answer your question because you seem not to know what sort of grouping you want, what is a cluster in your case. — ttnphns, Dec 10 '14 at 18:55
I would appreciate if you could give me a nudge towards a direction. Foe example, is my case a subject of sequential analysis? Should I study simple K-means? I think I just need a direction to follow. Thank you in advance. — user55534, Dec 11 '14 at 11:56
I would look at the length (and the number?) of the common sub-sequences That should hint the similarity/distance between the sequences.. — Vladislavs Dovgalecs, Oct 15 '15 at 18:48
You can also try to extract ngrams of symbols as features and then cluster the feature vectors. — Vladislavs Dovgalecs, Oct 15 '15 at 20:36

Areza · Answer 1 · 2019-12-12T08:47:15.533

One way to do it (among many other ways) is to treat the element of your sequence as a word. In other words, if your assume your list is a sentence, then you can extract ngrams.

import nltk
from nltk import ngrams
a = [1, 15, 1, 1, 13, 14]
b = [1, 1, 1, 1, 12, 1, 7, 11, 9, 11, 7, 11, 7, 11, 7, 4, 7, 7, 14, 15, 9, 2]
c = [13, 1, 13, 15, 13, 2, 9, 2, 9, 2, 2, 2, 2, 2, 2, 2]
d = [1, 2, 9, 1, 6, 10, 6, 1, 6, 10, 14, 3, 10]

bb = list()
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in a])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in b])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in c])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in d])))

I added the x, because seems CountVectorizer neglects single numbers/letters. Lets do word count - alternatively you can go ahead with ngrams (read the sklearn documentation here ) as well

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bb)
X.toarray()

The out put looks like this

array([[3, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [5, 0, 4, 1, 0, 1, 1, 1, 0, 1, 0, 6, 2],
       [1, 0, 0, 0, 3, 0, 1, 9, 0, 0, 0, 0, 2],
       [3, 3, 0, 0, 0, 1, 0, 1, 1, 0, 3, 0, 1]])

basically columns corresponds to words which are

print(vectorizer.get_feature_names())

['x1', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x2', 'x3', 'x4', 'x6', 'x7', 'x9']

and rows are your samples.

Now that you have a feature matrix, you can go ahead and do clustering, for example kmeans

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_

which results

array([0, 1, 0, 0], dtype=int32)

score 2 · Answer 2 · answered Nov 20 '15 at 15:24

K-means won't work on data of this type. To me, the strings you provided as examples lend themselves to information theoretic approaches to clustering based on MDL (minimum description length https://en.wikipedia.org/wiki/Minimum_description_length) or data compression. By compressing these strings to their unique sequence (removing the redundancy), larger patterns can emerge. There are many data compression algorithms out there.

A good overview can be found in Emmerg-Streib and Dehmer's Information Theory and Statistical Learning.

http://www.amazon.com/Information-Theory-Statistical-Learning-Emmert-Streib/dp/0387848150/ref=sr_1_1?ie=UTF8&qid=1448032965&sr=8-1&keywords=Information+Theory+and+Statistical+Learning

And a useful clustering algorithm could be permutation distribution clustering

https://cran.r-project.org/web/packages/pdc/pdc.pdf

score 1 · Answer 3 · answered Dec 14 '14 at 12:33

1

k-means must be able to compute means, so it won't work for you.

Consider using hierarchical clustering, with a Levenshtein or similar similarity metric. LCSS is also a good choice; any similarity designed for sequences.

answered Dec 14 '14 at 12:33

Has QUIT--Anony-Mousse

39,639
7
61
96

Cluster Sequences of data with different length

3 Answers3