2

I'm looking for a model to predict CTR (click-through-rate)

I have the following data: For each ad I know the number of impressions, clicks and some other attributes (which are mainly dummy variables).

The CTR per ad is calculated as follows: #clicks / #impressions.

I have two questions regarding predicting CTR:

  1. I am wondering which model should be used to predict the CTR. I tried a linear regression, but the R-squared is very low (around 10%-15%). A logistic regression is not an option as my dependent variable is not a 0/1 variable.

  2. When I run a linear regression with clicks as dependent variable and impressions, etc. as explanatory variables, my R-squared suddenly is around 85-95%. How is it possible that this differs so much from taking CTR as dependent variable?

EDIT: I followed the approach from kjetil, which works perfectly.

M09
  • 33
  • 4
  • Have you considered zero inflated model? Presumably, there'd be a lot of zeroes from people who don't click at all. – Huy Pham Jan 08 '19 at 05:03
  • @HuyPham I don't have user-level data, I only have aggregated data per ad, so for one ad I for instance know there have been 10000 impressions and 4 clicks. There are no ads with zero clicks in my dataset. Or should I maybe un-group the dataset such that for instance I get 9996 rows with zeros and 4 with ones (clicks)? – M09 Jan 08 '19 at 08:48
  • No the answer by Kiejtil works better. I just threw an idea out there. – Huy Pham Jan 08 '19 at 22:20

1 Answers1

0

You should try logistic regression. Let $x=\text{number of clicks}$, $n=\text{number of impressions}$. Then $\text{CTR}=x/n$, and in modeling that proportion directly you loose information. A logistic regression (possibly quasibinomial) gets at least the variance structure correct. In R you could do something like:

mod <- glm( cbind(x,n-x) ~ size + etc..., data=your_data_frame, family=quasibinomial())

A similar post with answer and example is Count explanatory variable, proportion dependent variable

EDIT

Good that you have tried some of my suggestions. Here some answers to your further edits:

  1. Look at this part of the output *Dispersion parameter for quasibinomial family * do there seem to be a substantial reduction?

  2. Look at the Deviance Residuals: from the output. Maybe most of the difference is in the extremes, but you could extract all of the residuals by resid(your_glm_object, type="deviance") and then plot against each other the residuals for each model, or their histograms.

  3. There is a version of AIC for quasi-models called QAIC (similar for BIC I suppose). A small paper about this in R (by Ben Bolker) is here. QAIC is implemented in some R packages, listed there.

  4. Getting predictions from this models: Use something like predict(your_glm_object, type="response", newdata=your_data_frame_with_new_data) For details see ?predict.glm.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    Thank you for your answer! The quasibinomial regression does not return a AIC value, which makes it hard to determine what the optimal model is (which variables may be excluded). How can I solve this? Also, how should I interpret the coefficient estimates, suppose size_small has a coefficient of -2.04, is it correct to say that the number of clicks decreases by approximately 2 if the ad is small compared to large? I think this is a wrong interpretation as it isn't a linear regression, but I don't know how to read it othterwise. – M09 Jan 07 '19 at 15:04
  • You could post some output from (one of) the analysis ... and see the links in my answer, which has some explication of interpretation. Logistic regression models a probability, so you can compare the predicted probabilities from the model, search this site for "interpreting logistic regression" – kjetil b halvorsen Jan 07 '19 at 16:16