Cluster clickstream data

Question

I've recently entered the realm of machine learning and a project I am working on requires me to cluster users based on the order they visited webpages on a website. I have data in the form of:

['user_id', 1, 2, 4, 6, 3, 7, 3, 2, 4...]

Where each number is a category/page that the user visited. In addition the length of data for each user is not the same i.e. some users visit more pages than others.

I realize this is really vague and defining similarity it hard. I tried following the example in this research paper and to be honest a lot of it went over my head.

I need help in how to approach this problem and am open to new ideas and suggestions.

score 3 · Answer 1 · edited Jun 11 '13 at 18:23

It is a good question with many practical applications.

Your data are sequential so we need a similarity measure between any pair of sequences. I recommend Levensthein distance since it is very intuitive and very nicely defined. See also this nice bachelor thesis with an overview of more measures for sequential data.

Finally, if one has the distances between all pairs of sequences, we can use any clustering algorithm that takes a distance matrix as input (for example any hierarchical algorithm).

score 2 · Answer 2 · answered Jul 27 '16 at 12:27

2

You can use package clickstream or clickclust in R language. It performs exactly what you are looking for.

answered Jul 27 '16 at 12:27

Sagar

29
2

2

This is really better suited as a comment than an answer. – Silverfish Jul 27 '16 at 13:08

Cluster clickstream data

2 Answers2

Linked