2

I have website page load data. Every row represents one visitor, and the attributes are pages they have visited. However, they often only visit 2-3 pages, so most values are null. The values are how long it took for the page to load in seconds. The target_flag is 0/1 indicating conversion. I want to know whether longer page load times lead to lower conversion and whether it's certain pages that may affect the conversion rate.

visitor_cookie  Page A    Page B   Page C   Page D   Page E    Target

10ed4da0e0      .4        NA       .6       NA       NA         0

18f7746         .3        .4       NA       NA       NA         1

1ffdc2f6        0.527     NA       NA       NA       4.05733    0

226b9dc52f      NA        .3       NA       NA       3.077      0

241a6095a8      NA        .7       .4       .8       NA         1

I want to do some sort of logistic/GLM model, but I'm not sure how to handle all the NAs. I can't just remove them because every record has nulls and I can't just impute 0s, because every column has too many nulls. I thought about changing the dataframe from wide to long, so that every record represents one page load time (instead of every record being one customer visit). However, it seems like I can't do that, because then I'm removing an important factor that certain records are related.

How can I account for these varying null values? Thank you.

pythonnoob
  • 43
  • 3
  • Rather than think about imputing the page-load times of pages that a visitor didn't visit - which you can safely assume will have no effect at all on their probability of conversion - , think about encoding the fact that they didn't visit those pages as a predictor in your model. – Scortchi - Reinstate Monica May 10 '18 at 19:44
  • Thank you for the suggestion. I'm not sure how to do that though. I suppose I could add a column for every page that is 0/1 for whether they visited that page or not. But then I'd still have NAs in the original columns. – pythonnoob May 10 '18 at 20:01
  • Consider what, after including such a set of indicator variables as predictors, will be the effect of setting a page-load time of 'NA' to 0, or to 42 , or to any other value, on your model. – Scortchi - Reinstate Monica May 10 '18 at 20:42
  • hmmm, so if I set a "PageAIndicator" variable as 0 when "Page A" is NA (or 0 or 42). Then would my model just not take into account the weight of the "Page A" value? But wouldn't it still affect the final result when I don't want it too? Maybe I could account for that by adding them as an interaction effect in the model? Like PageAIndicator*PageA? Thank you for helping me think through this. – pythonnoob May 10 '18 at 23:52
  • See e.g. https://stats.stackexchange.com/a/105258/17230. And you're welcome. – Scortchi - Reinstate Monica May 11 '18 at 08:46

0 Answers0