One way to do it (among many other ways) is to treat the element of your sequence as a word. In other words, if your assume your list is a sentence, then you can extract ngrams.
import nltk
from nltk import ngrams
a = [1, 15, 1, 1, 13, 14]
b = [1, 1, 1, 1, 12, 1, 7, 11, 9, 11, 7, 11, 7, 11, 7, 4, 7, 7, 14, 15, 9, 2]
c = [13, 1, 13, 15, 13, 2, 9, 2, 9, 2, 2, 2, 2, 2, 2, 2]
d = [1, 2, 9, 1, 6, 10, 6, 1, 6, 10, 14, 3, 10]
bb = list()
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in a])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in b])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in c])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in d])))
I added the x
, because seems CountVectorizer
neglects single numbers/letters. Lets do word count - alternatively you can go ahead with ngrams (read the sklearn documentation here ) as well
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bb)
X.toarray()
The out put looks like this
array([[3, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[5, 0, 4, 1, 0, 1, 1, 1, 0, 1, 0, 6, 2],
[1, 0, 0, 0, 3, 0, 1, 9, 0, 0, 0, 0, 2],
[3, 3, 0, 0, 0, 1, 0, 1, 1, 0, 3, 0, 1]])
basically columns corresponds to words which are
print(vectorizer.get_feature_names())
['x1', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x2', 'x3', 'x4', 'x6', 'x7', 'x9']
and rows are your samples.
Now that you have a feature matrix, you can go ahead and do clustering, for example kmeans
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
which results
array([0, 1, 0, 0], dtype=int32)