Which statistical models are suitable for prediction with clickstream data?

Question

I'm a Statistics student, and I'm thinking of writing my master's thesis on clickstream data analysis.

For my analysis I have a pretty big dataset (80 million rows), each of them being a click "impression". The dataset is from a news website and includes information such as:

User ID when logged on the website
User ID when NOT logged on the website (like a cookie)
Time of the visit (hour and date)
Link of the visited webpage
"Section" of the link (for example, Sport, News, ... with many categories)
Number of clicks which led the user to land on that page

What I'd like to do with this data is find the probability that a new user would click on a given new article, in order to recommend the user what to read next. I have in mind something like a score.
Doing my research I found out that a common way to tackle this type of issues is with association rules, path analysis or collaborative filtering.

What I'd like to know is: is it possible to approach the problem with "classic" data mining/machine learning techniques? I'm talking about GLMs, decision trees, neural networks, ... and other similar algorithms for supervised learning.
I ask the question because being each row an impression I have some "path" for each user and I'm not sure if it would not be statistically correct to apply one of the models I mentioned.

What is it you want to know about these data? Without that, how can an analysis be chosen? — gung - Reinstate Monica, May 14 '16 at 18:18
You are absolutely right, how silly of me to forget such a fundamental detail. Post edited with this information. — hellter, May 15 '16 at 11:44

score 1 · Answer 1 · answered Aug 19 '17 at 09:05

I was tackling the exact same problem in my previous project. In this case, it is a food ordering company, so the data is clickstream and the target variable is ordered/not ordered. So, these are some techniques which were used and have been proven effective:

Tree-based models: As tree-based models are non-parametric in nature, they can be used for such problems. The approach can be scaled up to using random forests also.
Market Basket Analysis: This takes into account the entire journey of the user till they make the order, and till a time threshold for the ones who didn't. These models are notoriously popular to be not scalable, as they do full data scans every computation, and multiple computations are needed to model these. The faster version is the 'FP-growth' algorithm.

Here is a very related post from the DataScience SE site, which has more models and experiences from other users.

score 0 · Answer 2 · answered May 19 '16 at 11:12

Yes, you can. One thing to consider is engineering features from your data set that make the best use of the data you have collected - the raw variables don't look all that great. So, for example, from the website address you may be able to discover the site's host country. Or given the user id's, you may be able to group users into frequent users and occasional users, or (given the timestamp) night owls and day time users. Another thing to consider is how many different areas of the site a user visited.

An example of something that could be similar to your problem is the Avito competition on kaggle, which had a very similar target and arguably overlapping variables (admittedly said with incomplete understanding of your data) where the winner used gradient boosting (effectively a sophisticated ensemble of decision trees) and manually engineered variables, likely somewhat similar to what I described above. See http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/ (the entropy based features Zhang mentions could include something like the number of different areas a user visited as suggested above)

Thank you very much, Robert. I know the Avito competition, but in that case there was a binary target variable (click/no click), while that's not the case for my data. The only information I have on the articles is about their section (~500 sections in total), so I don't have a very clear idea on how to create a target variable here. What I'd like to do is use the variables I have (with some engineering too, as you said) to answer the question: "when a user comes to the website and reads some articles, what should I suggest him next?" Do you have any idea on how to "binarize" this problem? — hellter, May 21 '16 at 13:36

Which statistical models are suitable for prediction with clickstream data?

2 Answers2